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Abstract 



Editor: 

o 
o 

Wc unify /-divergences, Bregman divergences, surrogate loss bounds (regret bounds), 
proper scoring rules, matching losses, cost curves, ROC-curves and information. We do 
this by systematically studying integral and variational representations of these objects 
and in so doing identify their primitives which all are related to cost-sensitive binary 
classification. As well as clarifying relationships between generative and discriminative 
i— i views of learning, the new machinery leads to tight and more general surrogate loss 

bounds and generalised Pinsker inequalities relating /-divergences to variational diver- 
\l gence. The new viewpoint illuminates existing algorithms: it provides a new derivation 

of Support Vector Machines in terms of divergences and relates Maximum Mean Dis- 
crepancy to Fisher Linear Discriminants. It also suggests new techniques for estimating 
/-divergences. 

1. Introduction 

Machine learning problems often concern binary experiments. There it is assumed that 
observations are drawn from a mixture of two distributions (one for each class). These 
distributions determine many important objects related to the learning problems they 
underpin such as risk, divergence and information. Our aim in this paper is to present 
all of these objects in a coherent framework explaining exactly how they relate to each 
O other. 

1.1 Motivation 

There are many different notions that underpin the definition of machine learning prob- 
lems. These include information, loss, risk, regret, ROC curves and the area under 
them, matching loss functions, Bregman divergences and distance or divergence between 
probability distributions. On the surface, the problem of estimating whether two distri- 
butions are the same (as measured by, say, their Kullback-Leibler divergence) is different 
to the minimisation of expected risk in a prediction problem. One of the purposes of 
the present paper is to show how this superficial difference is indeed only superficial - 
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deeper down they are the same problem and analytical and algorithmic insights for one 
can be transferred to the other. 

Machine learning as a engineering discipline is still in its infancy 1 . There is no agreed 
language to describe machine learning problems (such is usually done with an informal 
mixture of English and mathematics). There is very little in the way of composability 
of machine learning solutions. That is, given the solution to one problem, use it to solve 
another. Of course one would like to not merely be able to do this, but to be certain 
what one might lose in doing so. In order to do that, one needs to be able to provide 
theoretical guarantees on how well the original problem will be solved by solving the 
surrogate problem. Related to these issues is the fact that there are no well understood 
primitives for machine learning. Indeed, what does that even mean? All of these issues 
are the underlying motivation for this paper. 

Our long term goal (towards which this paper is but the first step) is to turn the 
field of machine learning into a more well founded engineering discipline with an agreed 
language and well understood composition rules. Our motivation is that until one can 
start building systems modularly, one is largely restricted to starting from scratch for 
each new problem, rather than obtaining the efficiency benefits of re-use 2 . 

We are comparing problems, not solutions or algorithms. Whilst there have been 
attempts to provide a degree of unification at the level of algorithms (Altun and Smola, 
2006), there are intrinsic limits to such a research program. The most fundamental is 
that (surprisingly!) there is no agreed formal definition of what an algorithm really is, 
nor how two algorithms can be compared with a view to determining if they are the 
same (Blass and Gurevich, 2003). 

We have started with binary experiments because they are simple and widely used. 
As we will show, by pursuing the high level research agenda summarised above, we 
have managed to unify all of the disparate concepts mentioned and furthermore have 
simultaneously simplified and generalised two fundamental results: Pinsker inequalities 
between /-divergences and surrogate-loss regret bounds. The proofs of these new results 
rely essentially on the decomposition into primitive problems. 



1. Bousquet (2006) has articulated the need for an agreed vocabulary, a clear statement of the main 
problems, and to "revisit what has been done or discovered so far with a fresh look". 

2. Abelson et al. (1996) described the principles of constructing software with the aid of (Locke, 1690, 
Chapter 12, paragraph 1): 

The acts of the mind, wherein it exerts its power over simple ideas, are chiefly these three: 
(1) Combining several simple ideas into one compound one; and thus all complex ideas are 
made. (2) The second is bringing two ideas, whether simple or complex, together, and 
setting them by one another, so as to take a view of them at once, without uniting them 
into one; by which it gets all its ideas of relations. (3) The third is separating them from 
all other ideas that accompany them in their real existence; this is called abstraction: and 
thus all its general ideas are made 

Modularity is central to computer hardware (Baldwin and Clark, forthcoming, 2006) and other 
engineering disciplines (Gershenson et al., 2003). 
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1.2 Novelty and Significance 

Our initial goal was to present existing material in a unified way. We have indeed done 
that. In doing so we have developed new (and simpler) proofs of existing results. Addi- 
tionally we have developed some novel technical results: 1) a link between the weighted 
integral representations for proper scoring rules and those for /-divergences; 2) a unified 
derivation of the integral representations in terms of Taylor series; 3) use of these repre- 
sentations to derive new bounds for divergences, Bayes risks and regrets ("surrogate loss 
bounds" and Pinsker inequalities); 4) showing that statistical information (and hence /- 
divergence) are both Bregman informations; 5) showing connections between variational 
representation of risks and divergences; 6) the derivation of SVMs from a variational 
perspective; 7) results relating AUC (Area under the ROC Curve) to divergences. 

The significance of these new connections is that they show that the choice of loss 
function (scoring rule), /-divergence and Bregman divergence (regret) are intimately 
related — choosing one implies choices for the others. Furthermore we show there are 
more intuitively usable parameterisations for /-divergences and scoring rules (their corre- 
sponding weight functions). The weight functions have the advantage that if two weight 
functions match, then the corresponding objects are identical. That is not the case 
for the / parametrising an /-divergence or the convex function parametrising a Breg- 
man divergence. As well as the theoretical interest in such connections, these alternate 
representations suggest new algorithms for empirically estimating such quantities. 

1.3 Background 

Specific results are referred to in the body of the paper. We briefly indicate the broad 
sweep of prior work along the lines of the present paper. 

The most important precursors and inspiration are the three nearly simultaneous 3 
works by Buja et al. (2005), Liese and Vajda (2006) and Nguyen et al. (2005). The work 
by Dawid (2007) is very similar in spirit to that presented here. A crucial difference is 
that he relies on a parametric viewpoint, and can utilise the machinery of Riemannian 
geometry 4 . All of the results in the present paper are, in contrast, "coordinate-free." 
The motivation of the present work is closely aligned with that of Hand (1994) whose 
avowed aim was to "stimulate debate about the need to formulate research questions 
sufficiently precisely that they may be unambiguously and correctly matched with sta- 
tistical techniques." 5 

The paper presents a unification of sorts. This, in itself, is hardly new in machine 
learning. There are different approaches to unification. One distinction is between 
Monistic and Pluralistic approaches (James, 1909; Turkle and Papert, 1992). 

3. (Nguyen et al., 2005) is dated 13 October, 2005, (Liese and Vajda, 2006) was received on 26 October 
2005 and (Buja et al., 2005) is dated 3 November 2005. Shen's PhD thesis (Shen, 2005), which 
contains most of the material in (Buja et al., 2005), is dated 16 October 2005. 

4. Zhang (2004a); Zhang and Matsuzoe (2008) have developed a number of connections between convex 
functions, the Bregman divergences they induce, and Riemannian geometry. 

5. Hand and Vinciotti (2003) develop some refined machine learning tasks that can be viewed as 
weighted problems; confer Buja et al. (2005). 
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Monistic approaches aim for a single all encompassing theory 6 . A problem with 
most monistic approaches is that you have to accept it "all or nothing." There are many 
unifying approaches developed in Statistics and Machine learning that have left little 
trace . 

Pluralistic approaches are closer to what is proposed here (where, instead of search- 
ing for a single master representation, we study relationships and translations between 
a range of different representations). It resonates with Kiefer's assertion that "Statis- 
tics is too complex to be codified in terms of a simple prescription that is a panacea 
for all settings, and . . . one must look as carefully as possible at a variety of possible 
procedures. . . " (Kiefer, 1977). Examples of existing pluralistic attempts include limited 
problem catalogs such as for different notions of cost (Turney, 2000) or a restricted set 
of problems (Raudys, 2001). 

The decision theoretic approach (DeGroot, 1970; Berger, 1985; Kiefer, 1987) due to 
Wald (1950, 1949) is central to the present paper. The idea of seeking primitives for 
statistics dates back at least to the elementary experiments of Birnbaum (1961). The 
relationship between risks and Bregman divergences is studied by Griinwald and Dawid 
(2004); Buja et al. (2005). Summaries of earlier work on surrogate regret bounds and 
Pinsker bounds are given in Appendices C and D respectively. 

There are numerous possible definitions of information. Many of them are sterile; 
Csiszar (1978) and Aczel (1984) provide a critical analysis. Floridi (2004) discusses 
pluralistic versus monistic approach: is there one single definition of information, or 
should there be many different definitions depending on the particular problem? Our 
view, like Shannon (1948) is that there are many types. Shannon information was 
developed with communications problems in mind — there is no fundamental reason 
why it is the only notion of information that makes sense for learning and inference. 

There are many known relationships between risks and divergences between distri- 
butions many of which we explicitly discuss later in the paper 8 . The idea of solving 
a machine learning problem by using a solution to some other learning problem is now 
called machine learning reductions (Beygelzimer et al., 2008, 2005) 9 . Two key differences 

6. Monistic approaches can be categorised into at least four distinct categories. They are briefly sum- 
marised in Appendix B. 

7. For example: Nelson's use of non-standard analysis (Nelson, 1987; Lutz and Musio, 2005) as the 
foundations for probability; Tops0e's (Tops0e, 2006), Shafer and Vovk's (Shafer and Vovk, 2001) 
game theory as a basis, and Le Cam's use of Riesz measures on a vector lattice to replace the 
traditional sample space (LeCam, 1964). 

8. General results include those due to Osterreicher (2003); Osterreicher and Vajda (1993); Gutenbrun- 
ner (1990); Liese and Vajda (2006); Goel and DeGroot (1979); Golic (1987). Particular relations 
between risk in binary classification problems and /-divergences are not new (Poor and Thomas, 
1977; Kailath, 1967). Some more general results that relate the choice of loss function in a binary 
learning problem to particular /-divergences between the class-conditional distributions have been 
(re)-discovered (Eguchi and Copas, 2001; Nguyen et al., 2005; Osterreicher and Vajda, 1993). Known 
results relating different distances between probability distributions are summarised by Gibbs and 
Su (2002) 

9. The idea is not new. Equivalences are a natural structuring device and were explicit in Ashby's 
foundational work on cybernetics (Ashby, 1956), a precursor to Machine Learning. Ben-Bassat 
(1978) studied the concept of e-equivalence, Conover and Iman (1981) showed how rank tests can be 
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between the recent machine learning reductions literature and the present paper is that 
our relationships between problems are (usually) exact (instead of approximate) and we 
work with the true underlying distributions (rather than finite sample distributions). 
The theory of Comparison of Experiments, developed by Blackwell (1951, 1953), and 
significantly extended by LeCam (1964, 1986) is also related to the overall goal set out 
here 10 . 

Graphical representations have been used for a long while to better understand binary 
experiments 11 . These can be seen as representations of Binary Experiments. 

1.4 Outline 

The following is an outline of the main structure of this paper. 

Many of the properties of the quantities studied in this paper are directly derived 
from well-known properties of convex functions. In particular, a generalised form of 
Taylor's theorem and Jensen's inequality underpin many of the results in this paper. 

One of the simplest type of statistical problems is distinguishing between two distri- 
butions. Such a problem is known as a binary experiment. Two classes of measures of 
divergence between the distributions are introduced: the class of Csiszar /-divergences 
and the class of Bregman divergences. 

When additional assumptions are made about a binary experiment — specifically, a 
prior probability for each of the two distributions — it becomes possible to talk about 
risk and statistical information of an experiment that is defined with respect to a loss 
function. 



derived by applying nonparametric tests to order statistics, and Goldman et al. (1989); Bartlett et al. 
(1996) used reductions for theoretical purposes. However recently there has been a large number 
of explicit constructions of reductions (Zadrozny et al., 2003; Langford, 2006; Beygclzimer et al., 
2005; Langford and Beygelzimer, 2005; Langford and Zadrozny, 2005; Langford et al., 2006; Li and 
Lin, 2006; Beygelzimer et al., 2007; Langford, 2007; Scott and Davenport, 2007), or development of 
results which although not explicitly called reductions are effectively so (Brown et al., 2002; Brown 
and Low, 1996; Brown and Zhao, 2003; Chaudhuri and Loh, 2002; Cossock and Zhang, 2006; Cuevas 
and Fraiman, 1997; Domingos, 1999; Steinwart et al., 2005; Tasche, 2001). 

10. It has been used to define notions of isomorphism for statistical problem settings (Morse and Sack- 
steder, 1966; Sacksteder, 1967) and is the subject of three books (Strasser, 1985; Torgersen, 1991; 
Heyer, 1982) and a recent review (Goel and Ginebra, 2003). The key difference with the present 
work is that the comparison of experiments theory seeks results that hold for all loss functions rather 
than for a particular one; with a few exceptions (Torgersen, 1991, Chapter 10). Blackwell related 
comparisons to sufficient statistics and characterised comparisons. LeCam (1964) quantified com- 
parisons in terms of the degree to which one experiment is "better than" another (the deficiency 
distance). There are very few known examples of deficiency distance (Carter, 2002). Furthermore 
LeCam's theory is formulated in a particularly abstract way to make its theorems elegant (Yang and 
Le Cam, 1999). Renowned probabilists concur that its arcane formulation has made it inaccessible 
(van der Vaart, 2002; Pollard, 2000; Strasser, 2000). Consequently the subject has had relatively 
limited impact. 

11. In this paper we draw connections between Receiver Operating Characteristic (ROC) curves, 
(Fawcctt, 2006, 2004; Flach, 2003; Flach and Wu, 2005; Maxion and Roberts, 2004) the Area Under 
ROC Curve (AUC), (Cortes and Mohri, 2004; Hand, 2008; Hand and Till, 2001; Hanley and McNeil, 
1982) and Cost Curves (Drummond and Holte, 2006; Torgersen, 1991). 
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First, we present a number of results connecting risk, statistical information, /- 
divergence, and Bregman divergence and information that are scattered about the liter- 
ature These results show that all of these concepts are intimately related. Second, we 
exploit a result that shows that proper scoring rules — a natural class of losses for proba- 
bility estimation — have a Choquet representation; i.e. they are expressible as weighted 
integrals of a family of "primitive" losses, namely the cost-weighted misclassification 
losses. 

By combining this characterisation of proper scoring rules with the results relating 
risk, information and divergence we are able to identify similar primitives and weighted 
integral representations for /-divergences, statistical information, Bregman divergences, 
and Bregman information. These representations simplify the study of these concepts by 
identifying each with its corresponding weight function. These weight functions elucidate 
several properties of the risks, divergences and informations they characterise, including 
their optimality and their convexity or concavity. We provide a "translation" between 
weight functions that clarifies the relationships between these concepts. The weight 
function view also illuminates various "graphical representations" of binary experiments, 
such as ROC curves. 

Finally, we present two insights obtained from this unification. The first is a tem- 
plate for deriving Pinsker-like bounds on arbitrary /-divergences in terms of variational 
divergence and surrogate loss bounds which bound the regret of an hypothesis under 
an arbitrary scoring rules in terms of its regret under the cost-sensitive misclassification 
loss. The bounds we derive are more general than those previously presented. The 
second insight concerns the apparent difference between the Bayes risk (which involves 
an optimization) and the /-divergence (which does not). Both of these are equivalent in 
ways we show. Thus we consider "variational" approaches to divergences. One specific 
consequence of this is that maximum mean discrepancy (MMD) — a kernel approach to 
hypothesis testing and divergence estimation — is essentially SVM learning in disguise. 

1.5 Notational Conventions 

The substantive objects are defined within the body of the paper. Here we collect 
elementary notation and the conventions we adopt throughout. We write x A y : = 
min(x, y), x V y := max(x,y), (x) + := x V 0, (x)_ := x A and [p] = 1 if p is true 
and [p] = otherwise. The generalised function 5(-) is defined by f£ 5{x)f{x)dx = /(0) 
when / is continuous at and a < < b. The unit step U{x) = j ^S^dt. The real 
numbers are denoted M, the non- negative reals R + and the extended reals IR = KU {oo}; 
the rules of arithmetic with extended real numbers and the need for them in convex 
analysis are explained by Rockafellar (1970). Random variables are written in sans- 
serif font: S, X, Y. Sets are in calligraphic font: X (the "input" space), ^ (the "label" 
space). Vectors are written in bold font: a,a,x £ M m . We will often have cause to take 
expectations (E) over the random variable X. We write such quantities in blackboard 
bold: I, L, B, J etc. The elementary loss is £, its conditional expectation w.r.t. Y is L 
and the full expectation (over the joint distribution P of (X, Y)) is L. The lower bound 
on quantities with an intrinsic lower bound (e.g. the Bayes optimal loss) are written 
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with an underbar: L, L. Quantities related by double integration recur in this paper 
and we notate the starting point in lower case, the first integral with upper case, and 
the second integral in upper case with an overbar: w, W, W. Estimated quantities are 
hatted: fj. In several places we overload the notation. In all cases careful attention to 
the type of the arguments or subscripts reliably disambiguates. 



2. Convex functions and their representations 

Many of the properties of divergences and losses are best understood through properties 
of the convex functions that define them. One aim of this paper is to explain and relate 
various divergences and losses by understanding the relationships between their primitive 
functions. The relevant definitions and theory of convex functions will be introduced as 
required. Any terms not explicitly defined can be found in books by Hiriart-Urruty and 
Lemarechal (2001) or Rockafellar (1970). 

A set 8 C M. d is said to be convex if it is closed under linear interpolation. That is, 
for all A £ [0, 1] and for all points s±,S2 £ 8 the point Asi + (1 — A)s2 £ 8. A function 
<f> : § — ► R defined on a convex set 8 is said to be a (proper) convex function if all lines 
between points on the graph of 4> never lie below (p. 12 That is, for all A £ [0, 1] and 
points si, S2 £ § the function (f> satisfies 

0( Asi + (l _ A)a 2 ) > A^(ai) + (1 - A)0(s 2 ). 

A function is said to be concave if its additive inverse is convex. That is, (j) : 8 — > R is 
concave if — <f> is concave. 

The remainder of this section presents properties, representations and transforma- 
tions of convex functions that will be used throughout this paper. 



2.1 The Perspective Transform and the Csiszar Dual 

When 8 = M + we can define a transformation of a convex function (p : R + 
the perspective transform of (f>, denoted I& and defined for r £ M + by 



called 



t(/>(s/t), 
0, 

T<j>(0), 
S<t>'oo, 



T > 0,8 > 
T = 0,s = 

r > 0,8 = 
r = 0,s > 



(1) 



where 0(0) := lim s 



(s) £ R and 4>'oo IS the slope at infinity defined as 
4>(sq + s) - (f)(s ) (j)(s) 



:= lim 



= lim 

s— »+oo s 



(2) 



for every sq £ 8 where 4>(so) is finite. This slope at infinity is only finite when <j>(s) = 
O(s), that is, when 4> grows at most linearly as s increases. When ^ is finite it 



12. The restriction of the values of to R will be assumed throughout unless explicitly stated otherwise. 
This implies the properness of <f> since it cannot take on the values — oo or +oo. 
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measures the slope of the linear asymptote. The function 1$ : [0, oo) 2 — > R is convex in 
both arguments (Hiriart-Urruty and Lemarechal, 1993b) and may take on the value +oo 
when s or r is zero. It is introduced here as it will form the basis of the /-divergences 
described in the next section. 13 

The perspective transform can be used to define the Csiszdr dual : [0, oo) — > R of 
a convex function <p : M + — > R by letting 

^>(r):= 7,(1, r) = r^) (3) 

for all r G R + and 0^(0) := 0^. Note that the original <fi can be recovered from 1$ as 
^(a) = / / ( a ,l). 

The convexity of the perspective transform 7^ in both its arguments guarantees 
the convexity of the dual (jfi . Some simple algebraic manipulation shows that for all 
s,r G R+ 

h{s,T) = 7^o (r,s). (4) 

This observation leads to a natural definition of symmetry for convex functions. We 
will call a convex function 0- symmetric (or simply symmetric when the context is clear) 
when its perspective transform is symmetric in its arguments. That is, <j) is 0-symmetric 
when 7^(s, r) = 7^(r, s) for all s, r G [0, oo). Equivalently, is symmetric if and only if 
^ = <p. 



2.2 The Legendre-Fenchel Dual Representation 

A second important dual operator for convex functions is the Legendre-Fenchel (LF) 
dual. The LF dual <fi* of a function (p : S — > R is a function defined by 

0V) :=sup{( S , s*) (5) 

The LF dual of any function is convex and, if the function <f> is convex then the LF bidual 
is a faithful representation of the original function. That is, 

<t>**(s)= sup{( S *, a )-^V)} = ^)- (6) 

When (f>(s) is a function of a real argument s and the derivative (f)'(s) exists, the 
Legendre-Fenchel conjugate 0* is given by the Legendre transform (Hiriart-Urruty and 
Lemarechal, 2001; Rockafellar, 1970) 

<j>*(s) = s-(<f>')- 1 (s)-<f>((<(>')- 1 (s))- (7) 

13. The perspective transform is closely related to epi-multiplication which is defined for all r £ [0, oo) 
and (proper) convex functions <f> to be r ® cj> := s i— » t4>{s/t) for r > and is when r = s = 
and +oo otherwise. Bauschke et al. (2008) provides an excellent summary of the properties of this 
operation along with its relationship to other operations on convex functions. 
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2.3 Integral Representations 

In this paper we are primarily concerned with convex and concave functions defined on 
subsets of the real line. A central tool in their analysis is the integral form of their Taylor 
expansion. Here, 4>' and 4>" denote the first and second derivatives of (f> respectively. 

Theorem 1 (Taylor's Theorem) Let § = [so, s] be a closed interval of R and let 
4> : § — > R be differentiable on [so,s] and twice differentiable on (so,s). Then 

<f>(s) =<j)(s ) + (j)'{so){s-s )+ j {s-t)<t>"{t)dt. (8) 

The argument s appears in the limits of integral in the above theorem and con- 
sequently can be awkward to work with. Also, it will be useful to expand <fi about 
some point not at the end of the interval of integration. The following corollary of Tay- 
lor's theorem removes these problems by introducing piece- wise linear terms of the form 
(s-t)+ = (s-t) VO. 

Corollary 2 (Integral Representation I) Let 4> '■ [a, b] — > R be a twice differentiable 
function. Then, for all s, sq £ [a, b] we have 

<j>(s) = <Ks ) + <l>'(so)(s-so)+ f <f> S0 ( S ,t)(f)"(t)dt, (9) 

J a 

where 

<f>so{s,t) := < (10) 
[(t-s) + s > s 

is a piece-wise linear and convex in s for each SQ,t £ [a,b]. 

This result is a consequence of the way in which the terms 4>t effectively restrict the limits 
of integration to the interval (so, s) C [a, b] or (s, so) C [a, b] depending on whether so < s 
or so > s with appropriate reversal of the sign of (s — t). 

Liese and Vajda (2006) proved a general version of the above theorem that holds 
for functions with discontinuous first derivatives. Since convex functions are necessarily 
continuous, they replace the first derivative 4>' with a right-hand derivative (f/ + (which 
is guaranteed to exist) and the second derivative 4>" with the measure d<j)' + . To make 
the exposition simpler we will generally assume that the functions we study are suitably 
differentiable (but see the comment below on distributional derivatives). 

When a = and b = 1 a second integral representation for the unit interval can be 
derived from (9) that removes the term involving <fi' . 

Corollary 3 (Integral Representation II) A twice differentiable function <fi : [0, 1] — ► 

R can be expressed as 



cb(s) = <X0) + (0(1) - 0(O))s - / ${ 8 ,t) (11) 

J 

where ip(s,i) = (1 — t)s A (1 — s)t is piece-wise linear and concave in s £ [0, 1] for each 
t€[0,l]. 



9 



The result follows by integration by parts of t(p"(t). The proof can be found in Ap- 
pendix A.l. It is used in Section 5 below to obtain an integral representation of losses 
for binary class probability estimation. This representation can be traced back to Tem- 
ple (1954) who notes that the kernel ip(s,t) is the Green's function for the differential 
equation tp" = with boundary conditions ip(a) = ip(b) = 0. 

Both these integral representations state that the non-linear part of <f> can be ex- 
pressed as a weighted integral of piece- wise linear terms 4> So or ip. When we restrict our 
attention to convex <j> we are guaranteed the "weights" (j)"(t) for each of these terms are 
non-negative. Since the measures of risk, information and divergence we examine below 
do not depend on the linear part of these expansions we are able to identify convex 
functions with the weights w(t) = 4>"(t) that define their non-linear part. The sets of 
piece- wise linear functions {(f> so (s, t)} te t a u and {ip(s, i)}tg[o,i] can be though of as families 
of "primitive" convex functions from which others can be built through their weighted 
combination. Representations like these are often called Choquet representations after 
work by Choquet (1953) on the representation of compact convex spaces (Phelps, 2001). 

Equation 11 is also valid when <f>" only exists in a distributional sense (Antosik et al., 
1973; Friedlander, 1982). In fact all of the integral representation results in this paper 
are so valid, being able to deal with distributions is essential in order to understand the 
weight functions corresponding to the primitive /-divergences and loss functions. 

2.4 Bregman Divergence 

Bregman divergences are a generalisation of the notion of distances between points. 
Given a differentiable 14 convex function <f> : § — ► R and two points sq, s 6 S the Bregman 
divergence 15 of s from so is defined to be 

B^s, s ) := <f>(s) - cp(s ) - (s - s , V<f>(s )) , (12) 

where V<p(so) is the gradient of <p at so- A concise summary of many of the properties 
of Bregman divergences is given by Banerjee et al. (2005b, Appendix A). In particular, 
Bregman divergences always satisfy B^(s,sq) > and B^sq^sq) = for all s,sq £ S, 
regardless of the choice of <f>. They are not always metrics, however, as they do not 
always satisfy the triangle inequality and their symmetry depends on the choice of (j). 

When 8 = R and <fi is twice differentiable, comparing the definition of a Bregman 
divergence in (12) to the integral representation in (8) reveals that Bregman divergences 
between real numbers can be defined as the non-linear part of the Taylor expansion of 
(j). Rearranging (8) shows that for all s, So £ R 

s 

(s - t) <f>"(t)dt = <f>(s) - cf>(s ) -(s- s )<f>'(s ) = B^s, so) (13) 



14. Technically, (j> need only be differentiable on the relative interior ri(S) of S. We omit this requirement 
for simplicity and because it is not relevant to this discussion. 

15. Named in reference to Bregman (1967) although he was not the first to consider such an equation, 
at least in the one dimensional case (Brunk et al., 1957, p. 838). 
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since V</> = 4>' and the inner product is simply multiplication over the reals. This result 
also holds for more general convex sets S. Importantly, it intuitively shows why the 
following holds. 

Theorem 4 Let <p and if) both be real-valued, differentiable convex functions over the 
convex set 8 such that 4>(s) = ip(s) + as + b for some a, b € R. Then, for all s and sq, 
B<f,(s, s ) = B^(s, s ). 

A proof can be obtained directly by substituting and expanding ip in the definition of a 
Bregman divergence. 

2.5 Jensen's Inequality and the Jensen Gap 

A central inequality in the study of convex functions is Jensen's inequality. It relates 
the expectation of a convex function applied to random variable to the convex function 
evaluated at its mean. We will denote by [•] := J § • dfx expectation over S with respect 
to a probability measure \i over §. 

Theorem 5 (Jensen's Inequality) Let <p : 8 — > R be a convex function, fi be a 
distribution and S be an S-valued random variable (measurable w.r.t. fx) such that 
E M [|S|] < oo. The following inequality holds 

^^(S)]:=E^0(S)]-<A(E M [S])>O. (14) 

The proof is straight-forward and can be found in (Dudley, 2003, §10.2). Jensen's in- 
equality can also be used to characterise the class of convex functions. If 4> is a function 
such that (14) holds for all random variables and distributions then <f> must be convex. 16 
Intuitively, this connection between expectation and convexity is natural since expecta- 
tion can be seen as an operator that takes convex combinations of random variables. 

We will call the difference J M [</>(S)] the Jensen gap for <j)(S). Many measures of 
divergence and information studied in the subsequent sections can be expressed as the 
Jensen gap of some convex function. Due to the linearity of expectation, the Jensen gap 
is insensitive to the addition of affine terms to the convex function that defines it: 

Theorem 6 Let <p : 8 — > R be convex function and S and fi be as in Theorem 5. Then 
for each a, b G R the convex function ip(s) := cf>(s) + as + b satisfies J M [0(S)] = J m [t/>(S)], 
where (p so is as in (10). 

The proof is a consequence of the definition of the Jensen gap and the linearity of 
expectations and can be found in Appendix A. 2. An implication of this theorem is that 
when considering sets of convex functions as parameters to the Jensen gap operator they 
only need be identified by their non-linear part. Thus, the Jensen gap operator can be 
seen to impose an equivalence relation over convex functions where two convex functions 
are equivalent if they have the same Jensen gap, that is, if their difference is affine. 

16. This can be seen by considering a distribution with a finite, discrete set of points as its support and 
applying Theorem 4.3 of Rockafellar (1970). 
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In light of the two integral representations in Section 2.3, this means the Jensen gap 
only depends on the integral terms in (9) and (11) and so is completely characterised 
by the weights provided by <p" . Specifically, for suitably differentiable (f) : [a, b] — > M we 
have 



Since several of the measures of divergence, information and risk we analyse can be 
expressed as a Jensen gap, this observation implies that these quantities can be identified 
with the weights provided by 4>" as it is these that completely determine the measure's 
behaviour. 

3. Binary Experiments and Measures of Divergence 

The various properties of convex functions developed in the previous section have many 
implications for the study of statistical inference. We begin by considering binary ex- 
periments (P, Q) where P and Q are probability measures 17 over a common space X. 
We will often consider P the distribution over positive instances and Q the distribution 
over negative instances. The densities of P and Q with respect to some third reference 
distribution M over X will be defined by dP = p dM and dQ = q dM respectively. Un- 
less stated otherwise we will assume that P and Q are both absolutely continuous with 
respect to M. (One can always choose M to ensure this by setting M = (P + Q) /2; but 
see the next section.) 

There are several ways in which the "separation" of P and Q in a binary experiment 
can be quantified. Intuitively, these all measure the difficulty of distinguishing between 
the two distributions on the basis of instances drawn from their mixture. The further 
apart the distributions are the easy discrimination becomes. This intuition is made 
precise through the connections with risk and MMD later in Appendix F. 

A central statistic in the study of binary experiments and statistical hypothesis test- 
ing is the likelihood ratio dP/dQ. As the following section outlines, the likelihood ratio 
is, in the sense of preserving the distinction between P and Q, the "best" mapping from 
an arbitrary space X to the real line. 

3.1 Statistical Tests and the Neyman- Pearson Lemma 

In the context of a binary experiment (P,Q), a statistical test is any function that 
assigns each instance x 6 X to either P or Q. We will use the labels 1 and for P and Q 
respectively and so a statistical test is any function r : X — > {0, 1}. In machine learning, 
a function of this type is usually referred to as a classifier. The link between tests and 
classifiers is explored further in Section 4. 



17. We intentionally avoid too many measure theoretic details for the sake of clarity. Appropriate <j- 
algebras and continuity can be assumed where necessary. 
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Each test r partitions the instance space X into positive and negative prediction sets: 

:= {x G X : r(x) = 1} 
X~ := {x E X : r(x) = 0}. 

There are four classification rates associated with these predictions sets: the true positive 
rate (TP), true negative rate (TN), false positive rate (FP) and the false negative rate 
(FN). For a given test r they are defined as follows: 

TP r := P(X+) FP r := Q(X+) 
FiV r := P(X-) TiV r := Q(X~) 

The subscript r will often be dropped when the test made clear by the context. Since P 
and Q are distributions over X = X+ U X~ and the positive and negative sets are disjoint 
we have that TP + FN = 1 and FP + TN = 1. As a consequence, the four values in 
(15) can be summarised by choosing one from each column. 

Often, statistical tests are obtained by applying a threshold to to a real-valued test 
statistic t : X — > M. In this case, the statistical test is r(x) = It(x) > roj. This leads 
to parametrised forms of prediction sets Xt(to) := X| r>r , for y E {+, — }, and the clas- 
sification rates TP t (to), FP t (to), TN t (to), and TP t (tq) which are defined analogously. 
By varying the threshold parameter a range of classification rates can be achieved. This 
observation leads to a well known graphical representation of test statistics known as 
the ROC curve, which is discussed further in Section 6.1. 

A natural question is whether there is a "best" statistical test or test statistic to 
use for binary experiments. This is usually formulated in terms of a test's power and 
size. The power (3 r of the test r for a particular binary experiment (P, Q) is a synonym 
for its true positive rate (that is, (3 r := TP r and so 1 — (3 r := FN r ls ) and the size a r 
of same test is just its false positive rate a r := FP r . Here, "best" is considered to be 
the uniformly most powerful (UMP) test of a given size. That is, a test r is considered 
UMP of size a G [0, 1] if, a r = a and for all other tests r' such that a r i < a we have 
1 — /?r > 1 — fir'- We will denote by (3(a) := f3(a, P, Q) the true positive rate of a UMP 
test between P (the null hypothesis) and Q (the alternative) at Q with significance a. 
Torgersen (1991) calls /3(-, P, Q) the Neyman-Pearson function for the dichotomy (P, Q). 
Formally, for each a G [0, 1], the Neyman-Pearson function (3 measures the largest true 
positive rate TP r of any measurable classifier r : X — > {—1, 1} that has false positive rate 
FP r at most a. That is, 

(3(a) = (3(a,P,Q) := sup{TP r : FP r < a}. 

r 

The Neyman-Pearson lemma (Neyman and Pearson, 1933) shows that the likelihood 
ratio t*(x) = dP/dQ(x) is the uniformly most powerful test for each choice of threshold 

18. This is opposite to the usual definition of f3 r in the statistical literature. Usually, 1 — (3 r is used to 
denote the power of a test. We have chosen to use j3 r for the power (true positive rate) as this makes 
it easier to compare with ROC curves. 
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(15) 



To- Since each choice of tq £ M results in a test \dP/dQ > tq\ of some size a G [0, 1] we 
have that 19 

f3(FP T *( To )) = TP T .(r ) (16) 

and so varying To over R results in a maximal ROC curve. This too is discussed further 
in Section 6.1. 

The Neyman-Pearson lemma thus identifies the likelihood ratio dP/dQ as a particu- 
larly useful statistic. Given an experiment (P, Q) it is, in some sense, the best mapping 
from the space X to the reals. The next section shows how this statistic can be used as 
the basis for a variety of divergences measures between P and Q. 



3.2 Csiszar /-divergences 

The class of f -divergences (Ali and Silvey, 1966; Csiszar, 1967) provide a rich set of 
relations that can be used to measure the separation of the distributions in a binary 
experiment. An /-divergence is a function that measures the "distance" between a pair 
of distributions P and Q defined over a space X of observations. Traditionally, the /- 
divergence of P from Q is defined for any convex / : (0, oo) — > R such that /(l) = 0. In 
this case, the /-divergence is 



I f (P,Q) = E Q 



f 



(dPX\ = ( JdP 



\dQ 



x 



f{^)dQ (IT) 



when P is absolutely continuous with respect to Q and equal oo otherwise. 20 

The above definition is not completely well-defined as the behaviour of / is not 
specified at the endpoints of (0, oo). This is remedied via the perspective transform of /, 
introduced in Section 2.1 above which defines the limiting behaviour of /. Given convex 
/ : (0, oo) — > R such that /(l) = the f -divergence of P from Q is 

I f (P, Q) := E M [If( P , q)} = Ex~M [//(p(X), g(X))] , (18) 

where If is the perspective transform of /. 

The restriction that /(l) = in the above definition is only present to normalise 1/ so 
that I/(Q, Q) = for all distributions Q. We can extend the definition of /-divergences 
to all convex / by performing the normalisation explicitly. Since / (Eg [dP/dQ]) = /(l) 
this is done most conveniently through the definition of the Jensen gap for the function 
/ applied to the random variable dP/dQ with distribution Q. That is, for all convex 
/ : (0, oo) — > R and for all distributions P and Q 



f 



(-) 
\dQ) 



I f (P,Q)-f(l). (19) 



Due to the issues surrounding the behaviour of / at and oo the definitions in (17), 
(18) and (19) are not entirely equivalent. When it is necessary to deal with the limiting 

19. Equation (69) in Section 6.3 below, shows that (3(a) is the lower envelope of a family of linear 
functions of a and is thus concave and continuous. Hence, the equality in (16) holds. 

20. Liese and Miescke (2008, pg. 34) give a definition that does not require absolute continuity. 
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behaviour, the definition in (18) will be used. However, the version in (19) will be 
most useful when drawing connections between /-divergences and various definitions of 
information in Section 4 below. 

Several properties of /-divergence can be immediately obtained from the above def- 
initions. The symmetry of the perspective If in (3) means that 

I f (P,Q) = I f o(Q,P) (20) 

for all distributions P and Q, where /^ is the Csiszar dual of /. The non-negativity of the 
Jensen gap ensures that If(P, Q) > for all P and Q. Furthermore, the affine invariance 
of the Jensen gap (Theorem 6) implies the same affine invariance for /-divergences. 

Several well-known divergences correspond to specific choices of the function / (Ali 
and Silvey, 1966, §5). One divergence central to this paper is the variational divergence 
V(P,Q) which is obtained by setting f(t) = \t — 1| in Equation 18. It is the only /- 
divergence that is a true metric on the space of distributions over X (Khosravifard et al., 
2007) and gets its name from its equivalent definition in the variational form 

V(P, Q) = 2\\P - QHoc := 2 sup \P(A) - Q(A)\. (21) 

ACX 

(Some authors define V without the 2 above.) This form of the variational divergence 
leads is discussed further in Section 8. Furthermore, the variational divergence is one of 
a family of "primitive" /-divergences discussed in Section 5. These are primitive in the 
sense that all other /-divergences can be expressed as a weighted sum of members from 
this family. 

Another well known /-divergence is the Kullback-Leibler (KL) divergence KL(P, Q), 
obtained by setting f(t) = iln(t) in Equation 18. Others are given in Table 1 in Sec- 
tion 5.4. 

3.3 Generative Bregman Divergences 

Another measure of the separation of distributions can defined as the expected Bregman 
divergence between the densities p and q with respect to the reference measure M. 
Given a convex function : M + — ► M the generative Bregman divergence between the 
distributions P and Q is (confer (18)) 

M^P, Q) := E M [B^p, q)} = E x ^m [fl*(p(X), g(X))] . (22) 

We call this Bregman divergence "generative" to distinguish it from the "discriminative" 
Bregman divergence introduced in Section 4 below, where the adjectives "generative" 
and "discriminative" are explained further. 

Csiszar (1995) notes that there is only one divergence common to the class of /- 
divergences and the generative Bregman divergences. In this sense, these two classes 
of divergences are "orthogonal" to each other. Their only common point is when the 
respective convex functions satisfy f{t) = (j){t) = tint — at + b (for a,i e R) in which 
case both 1/ and are the KL divergence. 
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4. Risk and Statistical Information 

The above discussion of /-divergences assumes an arbitrary reference measure M over 
the space X to define the densities p and q. In the previous section, the choice of reference 
measure was irrelevant since /-divergences are invariant to this choice. 

In this section an assumption is made that adds additional structure to the rela- 
tionship between P and Q. Specifically, we assume that the reference measure M is a 
mixture of these two distributions. That is, M = irP + (1 — ir)Q for some it G (0, 1). 
In this case, by construction, P and Q are absolutely continuous with respect to M. 
Intuitively, this can be seen as defining a distribution over the observation space X by 
first tossing a coin with a bias it for heads and drawing observations from P on heads 
or Q on tails. 

This extra assumption allows us to interpret a binary experiment (P, Q) as an gener- 
alised supervised binary task (ir, P, Q) where the positive (y = 1) and negative (y = —1) 
labels y E y := { — 1,1} are paired with observations x E X through a joint distribution 
P over X x y. (We formally define a task later in terms of an experiment plus loss 
function.) Given an observation drawn from X according to M, it is natural to try to 
predict its corresponding label or estimate the probability it was drawn from P. 

Below we will introduce risk, regret, and proper scoring rules and show how these 
relate to discriminative Bregman divergence. We then show the connection between the 
generative view (/-divergence between the class conditional distributions) and Bregman 
divergence. 

4.1 Generative and Discriminative Views 

Traditionally, the joint distribution P of inputs x G X and labels y G y is used as the 
starting point for analysing risk in statistical learning theory. To better link risks to 
divergences, our analysis we will consider two related representations of P. 

The generative view decomposes P into two class- conditional distributions defined as 
P{X) := F(X\y = 1), Q(X) := F(X\y = -1) for all X C X and a mixing probability 
or prior it := P(X, y = 1). The discriminative representation decomposes the joint 
distribution into an observation distribution M(X) := P(X,V) for all X C X and an 
observation-conditional density or posterior r/(x) = ^(x) where H(X) := ¥(X,y = 1). 
The terms "generative" and "discriminative" are used here to suggest a distinction made 
by Ng and Jordan (2002): in the generative case, the aim is to model the class-conditional 
distributions P and Q and then use Bayes rule to compute the most likely class; in 
the discriminative case the focus is on estimating r/(x) directly. Although we are not 
interested in this paper in the problems of modelling or estimating we find the distinction 
a useful one 21 . 

21. The generative-discriminative distinction usually refers to whether one is modelling the process that 
generates each class-conditional distribution, or instead wishes solely to perform well on a discrimina- 
tion task (Drummond, 2006; Lasserre et al., 2006; Minka, 2005; Rubinstein and Hastie, 1997). There 
has been some recent work relating the two in the sense that if the class conditional distributions are 
well estimated then will one perform well in discrimination (Long and Servedio, 2006; Long et al., 
2006; Goldberg, 2001; Palmer and Goldberg, 2006). 
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Figure 1: The generative and discriminative view of binary experiments. 



Both these decompositions are exact since P can be reconstructed from either. Also, 
translating between them is straight-forward, since 



M = ttP + (1 — ir)Q and r\ = tt 



dP 
dM' 



so we will often swap between (77, M) and (tt, P, Q) as arguments to functions for risk, 
divergence and information. A graphical representation of the generative and discrimi- 
native views of a binary task is shown in Figure 1. 

The posterior r] is closely related to the likelihood ratio dP/dQ in the supervised 
binary task setting. For each choice of tt G (0, 1) this relationship can be expressed a 
mapping X n : [0, 1] — > [0, 00] and its inverse A" 1 defined by 



A 7r (c) :- 



1 — TT C 

TT 1 — C 
TTt 



TTt + 1 — TT 

for all c G [0, 1) and t £ [0, 00) and A^-(l) := 00. Thus 



(23) 
(24) 



v = K 1 ( 



These will be used later when connecting /-divergences and risk. 



4.2 Estimators, Classifiers and Risk 

We will call a (M-measurable) function 77 : X — > [0, 1] a class probability estimator. 
Overloading the notation slightly, we will also use fj = rj(x) £ [0, 1] to denote an esti- 
mate for a specific observation iEl. Much of the subsequent arguments rely on this 
conditional perspective. 

Estimate quality is assessed using a loss function I : ^ X [0, 1] — ► R and the loss of 
the estimate 77 with respect to the label y G \j is denoted £(y,fj). If 77 G [0, 1] is the 
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probability of observing the label y = 1 the point-wise risk of the estimate fj £ [0, 1] is 
defined to be the 77-average of the point-wise loss for fj: 

L( V , fj) := E Y ^[£(Y, f])} = £(0, fj)(l -n) + £(l,fj)r,. (25) 

(This is what Steinwart (2006) calls the inner risk.) When 77 : X — > [0, 1] is an 
observation-conditional density, taking the M-average of the point-wise risk gives the 
(full) risk of the estimator fj: 

h( V , fj, M) := E M [L( V , fj)} = E x ^ M [L( V (X),fj(X))} = [ L(r,(x),fj(x)) dM(x) =: L(tt, fj, P, 

Jx 

The convention of using £, L and L for the loss, point-wise and full risk is used throughout 
this paper. 

We call the combination of a loss £ and the distribution P a task and denote it 
discriminatively as T = (rj, M; £) or generatively as T = (ir, P, Q;£). A natural measure 
of the difficulty of a task is its minimal achievable risk, or Bayes risk: 

UV, M) = L(tt, P, Q) := inf L(r?, fj, M) = E x ~m \L(rj(X))] , 

»?e[o,i] x 

where 

[0, 1] 3 77 1 — ^ L(rj) := inf L(r),fj) 
r?e[o,i] 

is the point-wise Bayes risk. Note the use of the underline on L and L to indicate that 
the corresponding functions L and L are minimised. 

4.2.1 Proper Scoring Rules 

If fj is to be interpreted as an estimate of the true positive class probability r] then it 
is desirable to require that L(rj,fj) be minimised by fj = r] for all r] G [0, 1]. Losses that 
satisfy this constraint are said to be Fisher consistent and are known as proper scoring 
rules (Buja et al., 2005; Gneiting and Raftery, 2007). That is, a proper scoring rule i 
satisfies L{j]) = L{r\,r\) for all r] G [0, 1]. 

Proper scoring rules for probability estimation and surrogate margin losses (confer 
Bartlett et al. (2006)) for classification are closely related. (Surrogate margin losses are 
considered in more detail in Appendix C.) Buja et al. (2005) note that "the surrogate 
criteria of classification are exactly the primary criteria of class probability estimation" 
and that most commonly used surrogate margin losses are just proper scores mapped 
from [0, 1] to M. via a link function. The main exceptions are hinge losses 22 which means 
SVMs are "the only case that truly bypasses estimation of class probabilities and directly 
aims at classification" (Buja et al., 2005, pg. 4). However, commonly used margin losses 
of the form <j){yF{x)) are a more restrictive class than proper scoring rules since, as Buja 
et al. (2005, §23) note, "[t]his dependence on the margin limits all theory and practice 
to a symmetric treatment of class and class 1". 

The following important property of proper scoring rules is originally attributed to 
Savage (1971). 

22. And powers of absolute divergence \y — r\ a for a =fc 2. 
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Theorem 7 The point-wise Bayes risk L(rj) for a proper scoring rule i is concave func- 
tion. Conversely, given a concave function A : [0, 1] — > R there exists a proper scoring 
rule £ so that L[rj) = A(ry) and 

L(T,,T)) = L(r))-(r)-Ti)lJ(rj). (26) 

Buja et al. (2005, §17) provide a proof which relies on Fisher consistency and the 
linearity of L(ij,fi) in r] which means the functions i] ^ L(rj,fj) are upper tangents to 
L{fj) for all fj £ [0, 1]. We provide an alternate proof below immediately after the proof 
of Theorem 19 which provides a general explicit formula for L(rj). 

This characterisation of the concavity of L means proper scoring rules have a natural 
connection to Bregman divergences. 

4.3 Discriminative Bregman Divergence 

Recall from Section 2.4 that if 8 C ~M. d is a convex set, then a convex function <f> : 8 — > R 
defines a Bregman divergence 

B<j,(s, s ) := (f>(s) - (f>(s ) - (s - s , V</>(s )) • 

When 8 = [0, 1], the concavity of L means 4>(s) = —L(s) is convex and so induces the 
Bregman divergence 23 

B<i>(s, so) = -L(s) + L(s ) - (s - s)L'(so) = L(s, s ) - L(s) 

by Theorem 7. The converse also holds. Given a Bregman divergence over § = [0, 1] 
the convexity of <j> guarantees that L = —<p is concave. Thus, we know that there is a 
proper scoring rule £ with Bayes risk equal to —cj). As noted by Buja et al. (2005, §19), 
the difference 

Bffarji) = L(r},ff) - L(rj) 

is also known as the point-wise regret of the estimate r) w.r.t. rj. The corresponding 
(full) regret is the M-average point-wise regret 

E X ~Af [B*(t/(X), r)(X))] = h( v , 7}) - Hv)- 

4.4 Bregman Information 

Banerjee et al. (2005a) recently introduced the notion of the Bregman information B^(S) 
of a random variable S drawn according to some distribution a over §. It is the minimal 
a-average Bregman divergence that can be achieved by an element s* G 8 (the Bregman 
representative). In symbols, 

B^S) := inf E s ^ [B^S, s)} = E s ^ [B^(S, s*)} . 
23. Technically, S is the 2-simplex {(si, S2) € [0, l] 2 : S1 + S2 = 1} but we identify s € [0, 1] with (s, 1 — s). 
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The authors show that the mean s := Es~ CT [S], is the unique Bregman representative. 
That is, B^(S) = E^B^S, s)]. Surprisingly, this minimiser only depends on 8 and a, 
not the choice of (f> defining the divergence and is a consequence of Jensen's inequality 
and the form of the Bregman divergence. 

Since regret is a Bregman divergence, it is natural to ask what is the corresponding 
Bregman information. In this case, <j) = —L and the random variable S = r/(X) £ 
[0, 1] where X G X is distributed according to the observation distribution M. Noting 
that Ex~m[ ? ?(X)] = 7r, the proof of the following theorem stems from the definition of 
Bregman information and some simple algebra showing that L(7r, n, M) = L(7r, M), since 
by assumption t is proper scoring rule. 

Theorem 8 Suppose £ is proper scoring rule. Given a discriminative task (rj, M) and 
letting <p = — Ll> the corresponding Bregman information of r/(X) satisfies 

B fa(X))=L(7r,M)-L(»7,M). 

4.4.1 Statistical Information 
The reduction of risk 

AL(t7, M) = AL(vr, P, Q) := L(tt, M) - L(r/, M) (27) 

is known as statistical information and was introduced by DeGroot (1962). This reduc- 
tion can be interpreted as how much risk is removed by knowing observation-specific 
class probabilities n rather than just the average it. 

DeGroot originally introduced statistical information in terms of what he called an 
uncertainty function which, in the case of binary experiments, is any function U : [0, 1] — > 
[0, oo). The statistical information is then the average reduction in uncertainty which 
can be expressed as a concave Jensen gap 

Sm[U(v)} = Sm[-U( V )} = U (Ex^Af fo(X)]) - E X ^M [t%(X))] . 

DeGroot noted that Jensen's inequality implies that for this quantity to be non-negative 
the uncertainty function must be concave, that is, —U must be convex. 

Theorem 8 shows that statistical information is a Bregman information and corre- 
sponds to the Bregman divergence obtained by setting (f> = —L. This connection readily 
shows that AL(r/, M) > (DeGroot, 1962, Thm 2.1) since the minimiser of the Bregman 
information is it = Ex^m[^(X)] regardless of loss and B^^tt) > since it is a regret. 

4.4.2 Unifying Information and Divergence 

From a generative perspective, /-divergences can be used to assess the difficulty of a 
learning task by measuring the divergence between the class-conditional distributions P 
and Q. The more divergent the distributions for the two classes, the easier the classifi- 
cation task. Osterreicher and Vajda (1993, Thm. 2) made this relationship precise by 
showing that /-divergence and statistical information have a one-to-one correspondence: 
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) 



(28) 



I f «(P,Q) = AL(n,P,Q) 



for all distributions P and Q. Conversely, if f is convex and f(l) = and ir £ [0, 1] 
then defining 



for all distributions P and Q. 

The proof, given in Appendix A. 3, is a straight-forward calculation that exploits the 
relationships between the generative and discriminative views presented earlier. Com- 
bined with the link between Bregman and statistical information, this result means that 
they and /-divergences are interchangeable as measures of task difficulty. The theorem 
leads to some correspondences between well known losses and divergence: log-loss with 
KL(P,Q); square loss with triangular discrimination; and 0-1 loss with V(P,Q). (See 
Section 5.5 for an explicitly worked out example.) 

This connection generalises the link between /-divergences and F-errors (expecta- 
tions of concave functions of rj) in Devroye et al. (1996) and can be compared the more 
recent work of Nguyen et al. (2005) who show that each /-divergence corresponds to the 
negative Bayes risk for a family of surrogate margin losses. The one-to-many nature of 
their result may seem at odds with the one-to-one relationship here. However, the family 
of margin losses given in their work can be recovered by combining the proper scoring 
rules with link functions. Working with proper scoring rules also addresses a limitation 
pointed out by Nguyen et al. (2005, pg. 14), namely that "asymmetric /-divergences 
cannot be generated by any (margin-based) surrogate loss function" and extends their 
analysis "to show that asymmetric /-divergences can be realized by general (asymmetric) 
loss functions" . 

4.5 Summary 

The main results of this section can be summarised as follows. 

Theorem 10 Let f : [0, oo) — > M be a convex function and for each it £ [0, 1] define for 




implies 



I f (P,Q) = M7(ir,P,Q) 



c€ [0,1); 



4>{c) 



1 - c 



f{K{c)) 



(29) 
(30) 



1 -7T 



-4>(c) 
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where X n is defined by (23). Then for every binary experiment (P,Q) we have 



I f (P, Q) = AL(j), M) = B^rj, M) 



(31) 



where M := ttP + (1 — ir)Q, rj := irdP/dM and L is the expectation (in X) of the 
conditional Bayes risk L. Equivalently, 



What this says is that for each choice of ir the classes of /-divergences If, statistical 
informations AL and (discriminative) Bregman informations can all be defined in 
terms of the Jensen gap of some convex function. Additionally, there is a bijection 
between each of these classes due to the mapping A,r that identifies likelihood ratios 
with posterior probabilities. 

It is important to note that the class of /-divergences is more "primitive" than the 
other measures since its definition does not require the extra structure that is obtained 
by assuming that the reference measure M can be written as the convex combination of 
the distributions P and Q. Indeed, each 1/ is invariant to choice of reference measure 
and so is invariant to the choice of ir. The results in the next section provide another 
way of looking at this invariance of If. In particular, we see that every /-divergence is 
a weighted "average" of statistical informations or, equivalently, If n divergences. 

5. Primitives and Weighted Integral Representations 

When given a class of functions like /-divergences, risks and measures of information it 
is natural to ask what the "simplest" elements of these classes are. We would like to 
know which functions are "primitive" in the sense that they can be used to express other 
measures but themselves cannot be so expressed. 

The main result of this section is that risks and /-divergences (and therefore also 
statistical and Bregman information) can be expressed as weighted integrals of these 
primitive elements. In the case of /-divergences and information the weight function in 
these integrals completely determines their behaviour. This means the weight functions 
can be used as a proxy for the analysis of these measures, or as a "knob" the user can 
adjust in choosing what to measure. 

We also show that the close relationships between information and /-divergence can 
be directly translated into a relationship between the weight functions of these measures. 
That is, given the weight function that determines an /-divergence there is, for each 
choice of the prior ir, a simple transformation that yields the weight function for the 
corresponding statistical information, and vice versa. 

5.1 Integral Representations of /-divergences 

The following result shows that the class of /-divergences (and, by the result of the pre- 
vious section, statistical and Bregman information) is closed under linear combination. 



S Q [f(dP/dQ)]=S M [-L(r])]=S M [<l>(v)]- 



(32) 
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Theorem 11 For all convex functions fi,f 2 - (0, oo) — > M. and all a\,a 2 G [0,oo) ; i/ie 
/unction 

(0, oo) 9 tn» ^(t) := + a 2 f 2 {t) (33) 

is convex. Furthermore, for all distributions P and Q, we have 

I g (P,Q)=a 1 I fl (P,Q) + a 2 I f2 (P,Q). (34) 

Conversely, given f\, f 2 , ol\ and a 2 , if (34) holds for all P and Q then g must be, up to 
affine additions, of the form (33). 

The proof is a straight-forward application of the definition of convexity and of /- 
divergences. 

One immediate consequence of this result is that the set of /-divergences is closed 
under finite linear combinations J2i a $fi- Furthermore, the integral representations 
discussed in Section 2.3 extend this observation beyond finite linear combination to 
generalised weight functions a. By Corollary 2, if / is a convex function then expanding 
it about 1 in (9) and setting a(s) = f"(s) means that 

/•oo 

l f (P,Q)= / I F3 (P,Q)a(s)ds (35) 
J o 

where F s (t) = [a < l](a - t) + + [s > lj(t - s) + . 24 The set of functions {F s }~ 
can therefore be seen as the generators of the class of primitive /-divergences. As a 
function of t, each F s is piece-wise linear, with a single "hinge" at s. Of course, any 
affine translation of any F s is also a primitive. In fact, each F s may undergo a different 
affine translation without changing the /-divergence If. The weight function a is what 
completely characterises the behaviour of If. 

The integral in (35) need not always exist since the integrand may not be integrable. 
When the Cauchy Principal Value diverges we say the integral takes on the value oo. 
We note that many (not all) /-divergences can sometimes take on infinite values. 

The integral form in (35) can be readily transformed into an integral representation 
that does not involve an infinite integrand. This is achieved by mapping the interval 
[0, oo) onto [0, 1) via the change of variables ir = ^ G [0, 1]. In this case, s = and 
so ds = — ^ and the integral of (35) becomes 

I f (P,Q) = - l\ Fl= ^{P,Q) a { l -^)K- 2 dK 

Jl T 

= j\ fn (P,Q) 1 (7T)d7T (36) 

24. Technically, one must assume that / is twice differentiable for this result to hold. However, the 
convexity of / implies it has well-defined one-sided derivatives f' + and a(s) can be expressed as the 
measure corresponding to df' + /d\ for the Lebesgue measure A. Details can be found in (Liese and 
Vajda, 2006). The representation of a general /-divergence in terms of elementary ones is not new; 
see for example Feldman and Osterreicher (1989). 
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where 



and 



— U \(7r(l+t)-l)+ , vr<i 



7W == V f ^) • (38) 



7T 3 ' \ 7T 



This observation forms the basis of the following theorem which will be used to discuss 
the connection between /-divergences and statistical information. 25 



Theorem 12 Let f be convex such that /(l) = 0. Then there exists a (generalised) 
function 7 : (0, 1) — ► R such that, for all P and Q: 

I f (P, Q)= f I U {P, Q) 7(f) d7r, where f n (t) = (1 - vr) A vr - (1 - vr) A (vrt). 

JO 

Proof The earlier discussion giving the derivation of equation (36) implies the result. 
The only discrepancy is over the form of /„-. However, this is remedied by noting that 
the family of f n given in (37) can be transformed by affine addition without affecting 
the representation of If . Specifically, 

f n (t) := (1 - vr) A vr - (1 - vr) A (vrf.) 

f(l-7r(l +<))+, vr>| 
\(vr(l+t)-l) + +vr(l-t) , vr<| 

= kit) + [tt < \Ml - t) 

and so f n and f n are in the same affine equivalence class for each vr G [0, 1]. Thus, by 
Theorem 6 we have If n = 1^ for each vr G [0, 1], proving the result. ■ 



The specific choice of f n in the above theorem from all of the affine equivalents was 
made to make simpler the connection between integral representations for losses and 
/-divergences, discussed in Section 5.4. 

One can easily verify that are convex hinge functions of t with a hinge at and 
/^(l) = 0. Thus {1/^1^6(0,1) is a family of primitive /-divergences; confer Osterreicher 
and Feldman (1981); Feldman and Osterreicher (1989). This theorem implies an existing 
representation of /-divergences due to Osterreicher and Vajda (1993, Thm. 1) and 
Gutenbrunner (1990). They show that an /-divergence can be represented as a weighted 

25. The 1/V 3 term in the definition of 7 seems a little unusual at first glance. However, it is easily 
understood as the product of two terms: 1/7T 2 from the second derivative of (1 — 7r)/7T, and 1/V from 
a transformation of variables within the integral to map the limits of integration from (0, 00) to (0, 1) 
via A^. 
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integral of statistical informations for 0-1 loss: for all P, Q 

I/(P,Q) = [ AL°- 1 (7r,P,Q) 7 (7r)d7r (39) 
J o 

™ = ^(^)- < 40 > 

An / divergence is symmetric if If (P, Q) = I/(Q, P) for all P, Q. The representation 
of 1/ in terms of 7 and Theorem 15 provides an easy test for symmetry: 

Corollary 13 Suppose If is an f -divergence with corresponding weight function 7 given 
by (40)- Then 1/ is symmetric iff 7 (it) = 7(1 — it) for all ir G [0, 1]. 

Proof Let f®(t) := t/(l/i) denote the Csiszar-dual of / as described in Section 2.1 
above. It is known (see (20) and e.g. Liese and Vajda (2006)) that 

I f (P,Q)=I f o(Q,P) if and only if f(t) = /<>(<) + c(t - 1) 

for some c 6 IR. Since / and 7 are related by /" (^ E ) = vr 3 7(7r) we can argue as 
follows. Observe that /<>'(*) = /(1/t) - f'(l/t)/t and /<>"(*) = f"{l/t)/t 3 . Hence 

/°" (^r) = /" (i^f) (l^) 3 - Let vr' = 1 - vr. Thus ^ = ^. Hence 




= 7r 3 7(l-vr). 



Thus if 7(1 — 7r) = 7(vr), we have shown 7r h- > 7(1 — 7r) is the weight corresponding to 
Observing that §^(f^(t) + c(i — 1)) = f®" concludes the proof. 



Corollary 13 provides a way of generating all convex / such that 1/ is symmetric 
that is simpler than that proposed by Hiriart-Urruty and Martmez-Legaz (2007): let 

7(71") = P(tt A (1 — 7r)) where (3 G (R + )'°'2l and generate / from 7 by inverting (40); 
explicitly, 

= [ ill (^TTF 7 (fTi) dT ) *' » eR+ - 

5.2 Proper Scoring Rules and Cost- Weighted Risk 

We now consider a representation of proper scoring rules in terms of primitive losses that 
originates with Shuford et al. (1966). Our discussion follows that of Buja et al. (2005) 
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and then examines its implications in light of the connections between information and 
divergence just presented. 

The cost-weighted losses are a family of losses parametrised by a false positive cost 
c G [0, 1] that defines a loss for y G {±1} and 77 G [0, 1] by 

4(y,r?) = c[y = -ljlfj > c] + (1 - c)fo = l}[rj < cj. (41) 

Intuitively, a cost-weighted loss thresholds fj at c and assigns a cost if the resulting 
classification disagrees with y. These correspond to the "signatures" for eliciting the 
probability 77 as described by Lambert et al. (2008). Substituting c = \ will verify that 
2Zi is equivalent to 0-1 misclassification loss £°~ 1 . 

2 

We will use L c , L c and AL C to denote the cost- weighted point-wise risk, full risk and 
statistical information associated with each cost-weighted loss. The following theorem 
collect some useful observations about these primitive quantities. The first shows that the 
point-wise Bayes risk is a simple, concave "tent" function. The second shows that cost- 
weighted statistical information is invariant under the switching of the classes provided 
the costs are also switched and that ir and 1 — c are interchangeable. 

Theorem 14 For all rj,c G [0, 1] the point-wise Bayes risk L c {rj) = (1 — rj)c A (1 — c)rj 
and is therefore concave in both c and rj. 

Proof From the definition of £ c in equation 41 and the definition of point-wise Bayes 
risk, we have for 77 G [0, 1] 

r?e[o,i] 

= inf {(l- v )c[rj> cj +7 1 (l-c)lf, <cj} 

r/S[0,l] 

= inf Ml - c) + (c- 77) [77 > cj,} 

r,e[0,l] 

where the last step makes use of the identity [77 < cj = 1 — [t) > cj . Since (c — 77) is 
negative if and only if 77 > c the infimum is obtained by having [77 > cj = 1 if and only 
if 77 > c, that is, by letting 77 = 77. In this case, when 77 > c we have ^(77) = c(l — 77) and 
when 77 < c we have L c (r]) = (1 — c)r/. The concavity of L c is evident as this function is 
the minimum of two linear functions of c and 77. ■ 



Theorem 15 For all c G [0,1] and tasks (r],M;£ c ) = (ir,P,Q;£ c ) the statistical infor- 
mation satisfies 1 ) 

AL C (1-77,M)=AL 1 _ C (7?,M), 



or equivalently, 
and 2) 



AL c (l-vr,Q,P) = AL 1 _ c (7r,P,Q); 
AL 7r (l-c,P,Q) = AL c (l-7r,P,Q). 
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Proof By Theorem 14 we know L c (rj) = min{(l — rj)c, (1 — c)rf\ and so L c (l — rf) = 
Li_ c (fj) for all 77, c G [0, 1]. Therefore, L c (l - r/, M) = hi- c (v, M) for any 77 : X [0, 1] 
including the constant function E«[r/]. By definition, AL c (r/, M) = L(Ejjf[f/],M) — 
L(?7, M) and so AL 1 _ C (?7, M) = AL C (1 — 77, M) proving part 1 of the lemma. 

Part 2 also follows from Theorem 14 by noting that £ c (l — vr) = L n (l — c) and that 
E M [L c (f?)] = J x min{(l - c)tt dP, (1 - -ir)cdQ}. ■ 



5.3 Integral Representations of Proper Scoring Rules 

The cost-weighted losses are primitive in the sense that they form the basis for a Cho- 
quet integral representation of proper scoring rules. This representation is essentially a 
consequence of Taylor's theorem and was originally studied by Shuford et al. (1966) and 
later generalised by Schervish (1989). The recent presentation of this result by Lam- 
bert et al. (2008) gives yet a more general formulation in terms of the elicitability of 
properties of distributions, along with a geometric derivation. An historical summary of 
decompositions of scoring rules is given by Winkler et al. (1990, Section 4). 

Theorem 16 A function I : y x [0, 1] — ► M is a proper scoring rule that is not everywhere 
infinite and satisfies 

e(y,y) = lim£(y,f)) (42) 
for y £ 0,1 iff for each fj G [0, 1] and y £ y 

£(y,fj)= [ e c (y,fj)w(c)dc (43) 
J 

where 

w{c) = -L"(c) (44) 
and L is the conditional Bayes risk for I. 

The conditions on the scoring rule are required to avoid meaningless losses that assign 
infinite costs regardless of the estimate, and those rules which jump to infinity at the 
endpoints of [0, 1]. As is the case throughout this paper, the second derivative of L is 
to be interpreted distributional^. That is, L" may be a generalised function such as the 
Dirac 5. The proof is technical and has been presented by Schervish (1989) and Lambert 
et al. (2008). 

This is a powerful result that effectively identifies all the Fisher consistent losses I for 
probability estimation (and hence most surrogate margin losses) with a weight function 
w. This shift from "losses as functions from estimates to costs" to "losses as sums of 
primitive losses" is (loosely!) analogous to the way the Fourier transform represents 
functions as sums of simple, periodic signals. 

We will write £ w , L w and h w to explicitly indicate the parametrisation of the loss, 
conditional loss and expected loss by the weight function w. We will also make use of 
the expression for B c derived by Buja et al. (2005): 



27 



Lemma 17 For any loss c £ [0, 1] the cost-weighted regret B c (rj,fj) := L c (rj,fj) — L c (r?) 
can be expressed as 

B c (r],f)) = \ri-c\lr)Ar)<c<ri\/f)}. (45) 

Proof From Theorem 14 we know that Ljjj) = min {(1 — r/)c, (1 — c)r]} and note that 
(1 — rj)c < (1 — c)t] <^=^ c < r\. Then, by the definition of L c and the identity 
1 — M = l^Pl we have 

B c(v,v) = (! - ri)clv > 4 + (1 - c)r)\f} < c] - min {(1 - r/)c, (1 - c)r)} 

= (1 - r ? )c[r) > c ] + (1 - cM?) < c] - (1 - 77)0^ > c] - (1 - c)r/^ < c] 
= (1 - ?7)c([r) > c] - [r? > cj) + (1 - c)^([^ < c] - [r, < cj). 

Note that [r) > c] — [77 > cj is either 1 or -1 depending on whether f]>c>rjorf]<c<rj 
and is zero otherwise. Similarly, \f] < c} — \r] < c} is 1 when fj < c < rj, is -1 when 
f\ > c > rj and is zero otherwise. This means 



J (1 — 7])c — (1 — c)l], fj > C > T] 

) — (1 — rf)c + (1 — c)rj, rj > c > f\ 

J C — 7], 7] > C> 7] 
1 77 — C, 7] > C> 7] 

= \t] — c\ [min{?7, f)} < c < max{?7, 7)}} 



as required. 



Theorem 18 Suppose w: [0, 1] — > M + is a weight function and let 

W(t) := y w(x)dx (46) 
VF(t) := / W{x)dx. (47) 



T/ien t/ie regret of 7) with respect to a true rj under the proper scoring rule induced by w 
satisfies 

B w (v, V) = W(rj) - W(fj) - (7/ - t))W(t)). (48) 



One can easily check that the arbitrary constants of integration in (46) and (47) cancel 
out in (48) and thus do not matter. 
Proof From (43) and (45) we have 

B w (t],7]) = / \t] — c\w(c)dc = / (77 — c)w(c)dc + / (c — 7])w(c)dc. (49) 

Jr/Af) Jr/Afj J r) 
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Now using integration by parts we have 

J "(c - r])w(c)dc = (c - rj)W(c) - j W{c)dc = (c - rj)W(c) - W(c). 

Similarly 

J(V- c)w(c)dc = -(c - i])W(c) + W(c). 
Thus from (49) we have 

B W ( V , 77) = [(c - ^)W(c) - W(c)] |J Aj) - [(c - ^)W(c) - W(c)] \^ 

= 2W(rj) - W(r) At))- W(r) V fj) - (77 - 77 A 77)1^(77 A 77) - (77 - 77 V 77)^(77 V 77). 

If 77 < 77, then 77 A 77 = 77 and 77 V 7} = 77 and we obtain 

^(77,77) = 2W(t?) - W(rj) - W(rj) - (77 - 77)^(77) - (77 - rj)W{fj) 
= W(rj) -W(fj) - (77-77)^(77). 

If instead 77 > 77, then 77 A 7} = 77 and 77 V 77 = 77 and we have 

#477,77) = Wfy) - W(t?) - W(t?) - (77-77)^(77) 
= W(?7) -W(t?) - (77-77)^(77). 

Thus in either case we obtain (48). ■ 
Using (48) we can take a Taylor series expansion of B w (rj, 77) in 77 about 77 to obtain 

B w (r),r)) = ^w(r])(f) - 7]) 2 + ^v/(r])(rj - ??) 3 + ^w"{rf){f] - T7) 4 + • • • 

This matches the second order result presented by Buja et al. (2005). 

We consider three examples. First, consider w(c) = 1 for c £ (0, 1). Thus W(c) = c 
and W(c) = c 2 /2 and thus 

77^ 77^ (77 — 77)^ 

B w (r],fj) = — - — - (r] -f])f] = 



2 2 yi 2 



which is also apparent from the above Taylor series. Second, consider w(c) = c ^i__ c ^ • We 
have W(c) = In (j=^J an d W(c) = (1 - c) ln(l — c) + cln(c) and thus 

B w (r],fj) = (1 — 77) ln(l — 77) + 77 In 77 — (1 — 77) ln(l — 77) — 77 In 77 — (77 — 77) In 



I-77 



1 — 77 \ / 77 

(1 - 77) In I + 77 In - 

\i-vJ \V 
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which agrees with the expression given by Buja et al. (2005). Finally consider w{c) = 
S(c — co), co £ (0,1). We have W(c) = U(c — Co) and W(c) = (c — co)+ and thus 
substituting into (48) we obtain 

B c (r),rj) = (v~co)+ ~(V- co)+ - (v ~ v)U(fj - c ), 

which agrees with (45). This can be written as 

Co — r] if rj < co and fj > co 
Bco(v> V) = { V ~ c o if V > co and f) < c (50) 
otherwise. 



In the special case that Co = \ we have 

s^'^-jo otherwise. (5ij 

A similar approach allows a direct calculation of a general form for the w-weighted 
conditional loss L w (r),fj). 

Theorem 19 Let w, W and W be as in Theorem 18. Then for all r),f] € [0, 1], 

L w (rj, fj) = -W(fj) + W(fj)(fi -7])+ T}(W(1) + W(0)) - W(0). (52) 
Furthermore the conditional Bayes risk satisfies 

L w {n) = L W (V, V) = -W( V ) + V (W(1) + W(0)) - W(0). (53) 



Proof Starting from the expression given by Buja et al. (2005, Equation 17) and again 
integrating by parts we have 

L W (V,V) = [ [v(l-t)lt>f)} + (l-r))tlt<r)}]w(t)dt 
Jo 

rl p 

= 7] (1 - t)w{t)dt + (1 - 77) / tw(t)dt 
Jfj Jo 

= n[(l- t)W(t) + W(t)] |J + (1- rj) [tW(t) - W(t)] \l 

= W(f))[(l - 7])7] - 7](1 - fj)] - W(rj) + 7]W{1) - (1 - 7])W(0) 

= -W(rj) + W(fi)(rj-ri) + T](W(l) + W(0))-W(0). (54) 

Since w is everywhere non-negative, W and W are too (we deal with the constants of 
integration shortly — see e.g. (82)). Consequently (54) is minimised by setting fj = 77 in 
which case we obtain (53). ■ 



30 



The above theorem leads to a simple direct proof of Theorem 7. 
Proof (Theorem 7) The concavity of L w {rj) follows immediately from (53) since W is 
convex, being the integral of a monotonically increasing function W, the integral of a 
non-negative function w. From (53) we have 

H w (fi) = -W(fi) + W(l) + W(0). 

Thus 

= -W(fj) + f)(W(l) + W(0)) - W(0) - (?) - ri)[-W(ff) + W(l) + W(0)] 
= -W(fj) + W(fj)(fj -7])+ r](W(l) + W(0)) - W(0) 
= L W (V,V) 

where the last step follows from (52). This proves (26). ■ 



5.3.1 Convexity, Matching Losses and Canonical Links 

Recall from Section 2.2 that the Legendre-Fenchel dual of / can be expressed in terms 
of its derivative and inverse. Furthermore in this case (writing Df := /') /' = (D/*) _1 . 
Thus with w, W, and W defined as above, 

W = {D{W*))~ 1 , W~ 1 = D(W*), W* = Jw- 1 . (55) 

We now further consider B w as given by (48). It will be convenient to parametrise B by 
W instead of w. Note that the standard parametrisation for a Bregman divergence is in 
terms of the convex function W. Thus will write Bjy, Bw and B w to all represent (48). 
The following theorem is well known (e.g Zhang (2004a)) but as will be seen, stating it 
in terms of By/ provides some advantages. 

Theorem 20 Let w, W, W and By/ be as above. Then for all x,y G [0, 1], 

B w (x, y) = B w -,{W{y), W{x)). (56) 

Proof Using (7) we have 

W*(u) = u ■ W _1 (u) - W{W~\u)) 

W(W-\u)) = u-W- 1 {u) -W*(u). (57) 

Equivalently (using (55)) 

W*(W(u)) = u ■ W{u) - W{u). (58) 
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Thus substituting and then using (57) we have 



B w {x,W- 1 {v)) = W(x)-W(W- 1 (v))-(x-W- 1 (v))-W(W'\v)) 
= W(x) + W*(v) - vW' 1 ^) - (x - W' 1 ^)) ■ v 
= W(x) + W*(v) -x-v. (59) 

Similarly (this time using (58) we have 

B w -i(v,W(x)) = W^(u)-W*(W'(a;))-(t;-W'(x))-W'- 1 (W r (x)) 
= W*(v) -xW(x) +W(x) -v -x + xW{x) 
= W*(v) + W(x) - v ■ x (60) 

Comparing (59) and (60) we see that 

B w {x,W-\v)) = B w -i{v,W{x)) 

Let y = W~ 1 (v). Thus subsitituting v = W(y) leads to (56). ■ 

The weight function corresponding to B w -i is ~g^W~ l {x) = ^py=T^) ■ 

Often in estimating rj one uses a parametric representation of fj: X — »[0,1] which has 
a natural scale not matching [0, 1]. In such cases it is common to use a link function 
(McCullagh and Nelder, 1989; Kivinen and Warmuth, 2001; Helmbold et al., 1999). Tra- 
ditionally one writes fj = ip~ l {h) where ip -1 is the "inverse link" (and ip is of course the 
forward link) . The function h: X — ► R is the hypothesis. Often h = h a is parametrised 
linearly in a parameter vector a. In such a situation it is computationally convenient 
if L]y(rj, t/>~ 1 (/i)) is convex in h (which implies it is convex in a when h is linear in 
a). The following result provides a simple sufficient condition for the "composite loss" 
Lw{n, tp~ 1 (h)) to be convex in h. It was previously shown (with a more intricate proof) 
by Buja et al. (2005). The result also corresponds to the notion of "matching loss" as 
developed by Helmbold et al. (1999) and Kivinen and Warmuth (2001). 

Theorem 21 Let w, W , W and Byy be as above. Denote by L\y the w-weighted con- 
ditional loss parametrised by W = f w. If the inverse link ip~ l = W^ 1 (and thus 
fj = W~ l {h)) then 

B w ( V ,fj) = B w ( V ,W-\h))=W( V )+W*(h)- V .h 

L w (rj,fj) = L w ( V ,W-\h))=W\h)- V -h + n(W(l)+W(0))-W(0) 

oh 

and furthermore Bw(r],W^ 1 (h)) and Lw{f],W~ l {h)) are convex in h. 



Proof The first two expressions follow immediately from (59) and (60) by substitu- 
tion. The derivative follows from calculation: -^Ly/ij], W~ l (h)) = (h) — f] ■ h) = 

W^^h) — 7] = fj — rj. The convexity follows from the fact that W* is convex (since it is 
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the LF dual of a convex function W) and the overall expression is the sum of this and 
a linear term, and thus convex. ■ 

Buja et al. (2005) call W the canonical link. 

Importantly, the linearity of expectation means that the same weight function can 
be used to write a loss's risk and statistical information as a weighted integral of the 
primitives L c and AL C , respectively. When combined with Theorem 9, these results give 
a similar weighted integral representation for Bregman divergences. 



5.4 Relating Integral Representations for L and If 

We can also give a translation between the weight functions 7 for an /-divergence and 
w for the corresponding statistical information. 



Theorem 22 Let f be convex (with /(l) = 0) define If and the weight function 7. 
Then for each ir € (0, 1) the weight function w w in Theorem 16 for the loss given by 
Theorem 9 satisfies 

7r(l-7r) ( (l-c)vr 
Me) = —, ytI 



1/(71", c) 3 \ v(ir,c) 
or, inversely, 

7(e) = ^-f m Mi-°) 



v(7T, c) 3 \ ^(7T,c) 

where u{tt, c) = (1 — c)ir + (1 — tt)c. 



Proof Theorem 9 shows that 



1—7] 
1 -7T 



/ 



7T 7] 



7T 1—7] 



(61) 



and we have seen from (44) that w w (c) = —(L*)"(c). The remainder of this proof involves 
taking the second derivative of L, doing some messy algebra and matching the result to 
the relationship between 7 and /" in (equation 40). 

Letting r n = ^-j^ and taking derivatives of (61) yields 



-m'(n) 



'1 



TT 



-/(rv) + (1 - rj)f'(rM 



(i - *)-*[- f(r*K + (i - vxf^w + rwco 2 ) - n*M 

(1 _ vr)- 1 [(-2r; + (1 - V )Of'(r w ) + (1 - vKO'f" (r w )]. 



However, the form of r n means r' n = jy^p. an d so = ( 1-r? )3 • This means 
coefficient of f'ir^) in the above expression vanishes 



the 



(-2r; + (l-r?K) 



7T 



{1-rjf 



(1-77) 



(l-r?)3 



0. 
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Substituting this back into — (L)" gives us 

-m"(v) = ^/"(rvxo 2 

1 — 7T 

1-r, „/ 1-tt r? \ (l-vr) 2 1_ 

1-vr V 7T l-rj vr 2 (1-r?) 4 

" (r?) = ^{T^W f {——rj)- 
By equation 40 we have 

^=M^)- (62) 

Letting t = {i_c)tt+(J-tt)c in that expression gives 

/ (1-C)7T \ = ^7T,c) 3 A^[_c\ _ 

\ v(lT-,C) J (1 — c) 3 7T 3 \ 7T 1 — C/ 

Thus 

~ 7T) / (l - C)?r \ _ 1-7T „ / l - 7T C \ = 

^(vr,c) 3 7 V ^(tt,c) J 7r 2 (l-c) 3/ \ vr 1-c/ W[C> 
as required. The argument to show the inverse relationship is essentially the same. ■ 



The representation (39,40) allows the determination of weights for common /-divergences. 
KL(P,Q) corresponds to 7(71-) = ^ (1 1 _ 7r) . Thus J(P, Q) = KL(P, Q) + KL(Q, P) corre- 
sponds to 7(71") = 7r 2( 1 1 _ 7r -)2 • Several /-divergences are presented with their corresponding 
weight function in Table 1. The weight for KL(P, Q) has a double pole at ir = which 
is why KL-divergence is hard to estimate 26 . 

5.5 Example — Squared Loss 

We illustrate some of the above concepts with a simple example. Consider squared loss. 
We have 

L(r], 77) = r) 2 (l - 77) + (77 - l) 2 f] 

and thus L{rj) = £(77,77) = 7/(1 — 77) and L/'(rj) = —2 and thus by (44) w(rj) = 2. From 
(28) we thus have 

_ 7T(1 - 7T)(7rf + 1 - 7T) - (1 - TpTrt 

; lt j " vrt + 1 - 7T 

Choosing 7r = \ this becomes /a (t) = One can check that 8 ■ / a (rj) + 7;— 1 = ^pj- 

which agrees with the / corresponding to Capacitory Discrimination in Table 1. Scaling 



26. Considering KL-divergence from the weight function perspective immediately suggests a scheme to 
estimate it: avoid attempting to estimate the regions near zero and one where the weight function 
diverges. A particular example of this is the divergence we have called KL e in Table 1. This approach 
to regularizing the KL-divergence was suggested by Gutenbrunner (1990, page 454). 
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is just a question of normalisation and we have already seen that 1/ is insensitive to 
affine offsets in /. This illustrates the awkwardness of parameterising 1/ in terms of /: 

at first sight and f+ j seem different. Using weight functions automatically filters 
out the effect of any affine offsets — if the weight functions corresponding to f\ and fa 
match, then Ij x = I/ 2 . Finally observe that substituting j(ir) = 8 from the table into 
Theorem 22 we obtain wi(c) = -8 = 2 consistent with the weight obtained above. 

6. Graphical Representations 

The last section described representations of risks and /-divergences in terms of weighted 
integrals of primitive functions. The weight functions and values of the primitive func- 
tions lend themselves to a graphical interpretation that is explored in this section. In 
particular, a diagram called a risk curve is introduced. This is shown to be closely re- 
lated to the cost curves of Drummond and Holte (2006) as well as an idealised receiver 
operating characteristic, or ROC curve (Fawcett, 2004). Risk curves are useful aids to 
intuition when reasoning about risks, divergences and information and they are used 
extensively in Section 7 to derive bounds between various divergences and risks. 

6.1 ROC Curves 

Plotting a receiver operating curve or ROC curve is a way of graphically summarising 
the performance of a test statistic. Recall from Section 3.1 that in the context of a binary 
experiment (P, Q) on a space X, a test statistic r is any function that maps points in X 
to the real line. Each choice of threshold to £ 1 results in a classifier r{x) = [r > roj 
and its corresponding classification rates. An ROC curve for the test statistic r is simply 
a plot of the true positive rate of these classifiers as a function of their false positive rate 
as the threshold To varies over R. Formally, 

ROC(r) := {(FP t (t ),TP t (t )) :r el}C [0, l] 2 . 

A graphical example of an ROC curve is shown as the solid black line in Figure 2. 

For a fixed experiment (P,Q), the Neyman-Pearson lemma provides an upper en- 
velope for ROC curves. It guarantees that the ROC curve for the likelihood ratio 
t* = dP/dQ will lie above, or dominate, that of any other test statistic r as shown 
in Figure 2. This is an immediate consequence of the likelihood ratio being the uni- 
formly most powerful test since for each false positive rate (or size) a it will have the 
largest true positive rate (or power) (3 of all tests (Eguchi and Copas, 2001). 

The performance of a test statistic r shown in an ROC curve is commonly summarised 
by the Area Under the ROC Curve, AUC(r), and is closely related to the Mann-Whitney- 
Wilcoxon statistic. Formally, if (P, Q) is a binary experiment and r a test statistic the 
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False Positive Rate (FP) 



Figure 2: Example of an ROC diagram showing an ROC curve for an arbitrary statistical 
test t (middle, bold curve) as well as an optimal statistical test r* (top, 
grey curve). The dashed line represents the ROC curve for a random, or 
uninformative statistical test. 



AUC is 



AUC(r) := I (3 T {a) da 
Jo 

TP T (r )FP^T )dT , 



(63) 
(64) 



where (3 T (a) = TP t (tq) for aro£l such that FP t (tq) = a. 

In Section 3.1 the Neyman-Pearson lemma was used to argue that the curve (5(a) for 
the likelihood ratio dominates all other curves. As the likelihood ratio is used to define 
/-divergences, it is natural to ask whether the area under the maximal ROC curve is an 
/-divergence. That is, does there exist a convex / such that If(P,Q) = A\JC(dP/dQ)7 
Interestingly, the answer is "no". To see this, note that an /-divergence's integral can 
be decomposed as follows 



f(t) / dQdt 

'! JXt 



(65) 
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where %t '■= {x G X : dP/dP(x) = t} = (dP/dQ) 1 (t). Compare this to the definition 
of AUC given in (64) when r = dP/dQ 

/oo 
TP T {t) FP' T (t) dt 
-oo 

OO P 

(Por _1 )([t,oo)) / dQdt (66) 

since FP' T (t) = d/dt J* t °° f Xt dQ(x) dt = — J % dQ and dP/dQ > 0. If we assume there 
exists an / such that for all binary experiments (P,Q) that If(P,Q) = A\JC(dP/dQ) 
we would require the integrals in (65) and (66) to be equal for all (P,Q). This would 
require f(t) = -(Po (dP/dQ)~ 1 )([t,oo)) for all t G [0, oo) which is not possible for all 
binary experiments (P, Q) simultaneously. 

Interestingly, even though maximal AUC for (P, Q) cannot be expressed as an /- 
divergence, Torgersen (1991) shows how it can be expressed as the variational divergence 
between the product measures P x Q and Q x P. That is, AXJC(dP/dQ) = V(dP x 
dQ,dQ x dP). Following up this connection and considering other /-divergences of 
product measures is left as future work. 

It is important to realise that AUC is not a particularly intrinsic measure — just a 
common one. As the earlier discussion of integral representations have shown, there is 
value in considering weighted versions of integrals such as (63). As Hand (2008) notes in 
his commentary on a recent paper (outlining another type of performance curve): "To 
use all the values of the diagnostic instrument, when integrating to yield the overall AUC 
measure, it is necessary to decide what weight to give to each value in the integration. 
The AUC implicitly does this using a weighting derived empirically from the data." 
Along these lines, Xie and Priebe (2002) and Eguchi and Copas (2001) have suggested 
generalisations of the AUC that incorporates weights and show that certain choice of 
weight functions yield well-known losses. 

A closer investigation of these generalisations of AUC and their connection to mea- 
sures of divergence is also left as future work. 

6.2 Risk Curves 

Risk curves are graphical representation closely related to ROC curves that take into 
account a prior ir in addition to the binary experiment (P,Q). They provide a concise 
summary of the risk of an estimator fj for the full range of costs c G [0, 1] for a fixed 
prior 7r G [0, 1], or, alternatively, for the full range of priors ir given a fixed cost c. 

Formally, a risk curve for costs for the estimator f] is the set {(c, L c (f), ir, P, Q)) : c G 
[0, 1]} of points parametrised by cost 27 . A risk curve for priors for the estimator fj is 
the set {(Tr,!. - 1 ^,^,^)) : tt G [0,1]}. 

Figure 3 shows an example of a risk curve diagram. On it is plotted the cost curves 
for an estimate 57 of a true posterior r] on the same graph. The "tent" function also 

27. Unlike the cost curves originally described by Drummond and Holte (2006), the version presented 
here does not normalise the risk and plots the cost on the horizontal axis rather than the product of 
the prior probability and cost. 
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Figure 3: Example of a risk curve diagram showing risk curves for costs for the true 
posterior probability 77 (bottom, solid curve), an estimate fj (middle, bold 
curve) and the majority class or prior estimate (top, dashed curve). 



shown is the risk curve for the majority class predictor min((l — tt)c, (1 — c)ir). Here 
7r = \. Other choices of it G (0, 1) skew the tent and the curves under it towards or 1. 

In light of the weighted integral representations described in Theorem 16, several of 
the quantities can be associated with properties of a cost curve diagram. The weight 
function w{c) associated with a loss i can be interpreted as a weighting on the horizontal 
axis of a risk curve diagram. When the area under a risk curve is computed with respect 
to this weighting the result is the full risk L since L(r?, fj) = L c (r/, fj) w(c) dc. 

Furthermore, the weighted area between the risk curves for an estimate r) and the 
true posterior r/ is the regret L(r7, fj) — L(r?) and the statistical information AL(ry, M) = 
L(7r, M) — L(?7, M) is the weighted area between the "tent" risk curve for 7r and the risk 
curve for 77. 

The correspondence between ROC and risks curves is due to the relationship between 
the true class probability 77 and the likelihood ratio dP/dQ for a fixed it. As shown in 
Section 4.1, this relationship is 

dP \-TT f] 

-tx) = A 7r (r ? ) 



dQ ! 7T 1—7] 

Each cost c 6 [0, 1] can be mapped to a corresponding test statistic threshold to = A^c) 
and vice versa. 

Drummond and Holte (2006) show that their cost curves have a point-line dual 
relationship with ROC curves. The same result holds for our risk diagrams. Specifically, 
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False Positive Rate 



Figure 4: Cost curve diagram (left) and corresponding ROC diagram (right). The black 
curves on the left and right represent risk and classification rates of an example 
predictor. The grey Bayes risk curve on the left corresponds to the dominating 
grey ROC curve on the right for the likelihood statistic. Similarly, the dashed 
tent on the left corresponds to the dashed diagonal ROC line on the right. 
The point labelled A in the risk diagram corresponds to the line labelled A* 
in the ROC diagram. 



for a given point (FP, TP) on an ROC diagram the corresponding line in a risk diagram 
is 

L c = (1 - 7T) CFP + 7T (1 - c) (1 - TP). 
Conversely, the line in ROC space corresponding to a point (c, L c ) in risk space is 

rp= (W)o Ff+ (l -n)c -L 

7r(l — C) 7r(l — C) 

An example of this relationship is shown graphically in Figure 4 between the point A 
and the line A*. 

As mentioned earlier, the Neyman-Pearson lemma guarantees the ROC curve for r] is 
maximal. This corresponds to the cost curve being minimal. In fact, these relationships 
are dual in the sense that there exists an explicit transformation from one to the other. 

6.3 Transforming from ROC to Risk curves and Back 

Recall from Section 3.1 the Neyman-Pearson function f3(a,P,Q) for the binary experi- 
ment (P, Q). Since the true positive rate for r is TP r = P(r _1 (l)) and the false positive 
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rate for r is FP r = Q[r 1 (1)) we have 

P(a,P,Q)= sup {P(X+) : Q(X+) < a} 

re{-l,l} x 

where X+ := r _1 (l). 

Noting that the 0-1 loss of r is simply its probability of error — that is, the average 
of the false positive and false negative rates — we have for each ir G [0, 1] that the Bayes 
optimal 0-1 loss is 

L(vr, P, Q) = inf{(l - vr)Q(X+) + tt(1 - P(X+))}. (67) 

r 

since the false negative rate FN r = P(X — X+) = 1 — P(X+). Thus for all ir, a G [0, 1], 
and all measurable functions r: X — > {— 1,1}, 

L(vr, P, Q) < (l-7r)Q(X+)+7r(l-P(X+)) 

< (l-vr)a + vr(l-P(X+)) 

< (l-ir)a + ir(l-P(a,P,Q)). 

Thus, we see that L(7r, P, Q) is the largest number L such that (1 — 7r)a+7r(l — (3(a)) > L 
for all a G [0,1] and hence one can set 

L(tt, P, Q) = L = min ((l-7r)a + 7r(l-/3(a)) (68) 

ae[0,l] 

for each tt G [0,1]. 

Conversely, we can express the Neyman- Pearson function /3 in terms of the Bayes 
risk. That is, for any a G [0, 1], (3(a, P, Q) is the largest number (3 such that 

VvrG[0,l] (1 - vr)a + vr(l - /3) > L(vr) 
VvrG[0,l] (1 - vr)a - L(vr) > vr(/3 - 1) 

VvrG(0,l] -((1 -Tr)a-L(vr)) > (3- 1 
& VvrG(0,l] P< -((l-7r)a + 7r-L(7r)). 

7T 

Thus we can set 

/3(q) = inf -((l-7r)a + 7r-L(7r)), a €[0,1]. (69) 
tt6(o,i] vr 

The expressions (69) and (68) are due to Torgersen (1991). When /?(•) and L(-) are 
smooth, explicit closed form formulas can be found: 

Theorem 23 Suppose (3 and L are differentiate on (0, 1] and [0, 1] respectively. Then 
L(ir) = (l-7r)^(7r) + 7r(l-^(7r))), ttG[0,1], (70) 
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where 



/3(tt) := P'- 1 



1 - 7T 



IT 



and 

/3(a) = r f-[(l-L(a))a + L(«)-L(L(a))], a€(0,l], (71) 
L(a) 



where 



L(a) := L A 1, 



L(tt) := L(tt) - 7rL'(7r). 
Proof Consider the right side of (68) and differentiate with respect to a: 
d 

— (1 - vr)a + tt(1 - /3(a)) = (1 - vr) - ir(¥(a). 

Setting this to zero we have (1 — it) = ir(3'(a) and thus /?'(«) = Since (3 is 

monotonically increasing and concave, (5' is monotonically decreasing and non-negative. 
Thus we can set 



Substituting back into (1 — 7r)a + 7r(l — (3(a)) we obtain (70). 
Now consider the right side of (69): 

-((l-7r)a + 7r-L(7r)). (72) 

7T 

Differentiating with respect to it we have =^ — + Setting this equal to zero 
we obtain 

-a L'(tt) L(vr) 



7T 7T 7T 2 



0, 7T€(0,1] 



=> a + ttV(tt) - L(vr) = 0. 
Observing the definition of L we thus have that h(n) = a. Now 

l'(tt) = ^-(-TL'(ir) + L(7r)) 

= -7rL"(7r) - L'(tt) + L'(tt) 
= -7rL"(7r) 
> 

since L is concave. Thus L(-) is monotonically non-decreasing and we can write it = 
L (a). In order to ensure tt G [0, 1] we substitute it = L(a) into (72) to obtain (71). ■ 

Using (71) we present an example. Consider (for 7 £ [0,1]) L(7r) = 771- (1 — 7r). One 
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Figure 5: Graph of /3 7 (a, P, Q) for 7 = z/20, i = 1, . . . , 20. 



can readily check that L(7r) = 77r 2 . Hence L (a) 



0, 



. Thus L(a) 



VL A 1 = \/a/j A 1. Substituting and rearranging we find that the corresponding 
/3 is given by 

a + 7 + (ya/7 A 1)(1 — — 7) 
a/o/7 Al 



/3 7 (a) 



A graph of this /?(•) is given in figure 5. 

By construction = 1 and /5 is concave and continuous on (0,1]. The following 
lemma is due to Torgersen (1991). 

Lemma 24 Suppose X contains a connected component C. Let <j): [0,1] — > [0,1] be an 
arbitrary function that is concave and continuous on (0,1] such that (f)(1) = 1. Then 
there exists P and Q such that f3(a, P, Q) = 0(a) for all a E [0, 1]. 

Proof Let X' = [0, 1] and P be the uniform distribution on X'. Overload P and Q 
to also denote the respective cumulative distribution functions (i.e. P(x) = P([0,x])). 
Thus P(tt) = 7r). Set Q(n) = (P(tt). Since </>(•) is increasing it suffices to consider r(-) of 
the form r n (x) = \x < 71"]. Hence 



/3(a) = max{^(7r) : < 7r < 1, 7r < a}, a G [0, 1]. 
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The maximum will always be obtained for tt = a and thus (3(a) = <p(a) for a G [0, 1]. 
Finally, a pair of distributions on X can be constructed by embedding the connected 
component C C X into X'. Choose g: C — ► X' such that g is invertible. Such a g always 
exists since S is connected. Then g^ 1 induces distributions P' and Q 1 on C and thus on 
X by subsethood. ■ 



Corollary 25 Let ip: [0,1] — > [0,1] be an arbitrary concave function such that for all 
tt € [0, 1], < ^>(7t) < 7rA(l — 7r). T/ien i/iere exists P and Q such that h(n , P, Q) = ip(ir) 
for all tt G [0, 1]. 

Proof Choose a z/> satisfying the conditions and substitute into (69). This gives a 
corresponding <^(-). We know from the preceeding lemma that there exist P and Q such 
that /?(-, P, Q) = </>(■) which corresponds to L(-, P, Q). Thus it remains to show that the 
function (p defined by 

<p(a) = inf — ((1 — ir)a + tt — ipM) 

7re(0,l] 7T 

is concave and satisfies <p(l) = 1. Observe that /3(1) = inf^gj-Q^] 1 ^ 7r - ) . Now by the 
upper bound on we have 1 ~^ 7r ^ > 1 ~^. +7r = ^ > 1. But Wva^^i 1 = 1 and thus 
0(1) = 1. Finally note that 

/3(q)= inf f^— a+(l-V(7r)). 
7re(0,l] V y 

This is the lower envelope of a parametrized (by tt) family of affine functions (in a) and 
is thus concave. ■ 



7. Bounding General Objects in Terms of Primitives 

All of the above results are exact — they are exact representations of particular primi- 
tives or general objects in terms of other primitives. Another type of relationship is an 
inequality. In this section we consider how we can (tightly) bound the value of a general 
object (1/ or B w ) in terms of primitive objects (V w — defined below — or B c ). Bound- 
ing If(P,Q) in terms of V K (P,Q) is a generalisation of the classical Pinsker inequality. 
Bounding B w (r), fj) in terms of B c (rj, fj) is a generalisation of the so-called "surrogate loss 
bounds." 

As explained previously, we work with the conditional Bregman divergence B w (rj, fj). 
Results in terms of B w (rj, fj), r],fj G [0, 1] immediately imply results for M w (rj, fj),rj,fj: X — > 
[0, 1] by taking expectations with respect to X. 
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7.1 Surrogate Loss Bounds 

Suppose for some fixed cq £ (0, 1) that B co (rj,fj) = a. What can be said concerning the 
value of B w (r),fj) for an arbitrary weight function wl This is known as a surrogate loss 
bound. Previous works on this problem are summarised in Appendix C. Apart from 
its theoretical interest, the question has direct practical implications: it can often be 
much simpler to minimise B w (rj, fj) over fj than to minimise B c (rj, fj). The bounds below 
will tell the user of such a scheme the maximum price they will have to pay in terms of 
statistical performance. 

Theorem 26 Let cq € (0, 1) and suppose it is known that B co (rj,fj) = a £ (0,co). Let 
w, W and W be as in Theorem 18. Then 

B W (V, fj) > [W(co -a) + aW(c )} A [W(c + a) - aW(c )\ - W(c ). (73) 

U c o = \ and w is symmetric about ^ (w{\ + a) = w{\ — a) for a G (0, ^) ) then 

B w ( V ,fj) > W{\ + a)-aW{\)-W{\) (74) 
= W{\-a)+aW{\)-W{\). (75) 

Furthermore (73) and (74) are the best possible. 

Proof By hypothesis B Co (r], i)) = a and thus from (50) it must be true that either 
(f] < co and fj > cq) or (r/ > cq and fj < co). Suppose for now it is the former. We need 
to determine the minimum possible value of B w (rj,fi). From (48) we thus seek 

min W(rj) - W(fj) - (rj - ff)W{ff). (76) 

r? G [0,c ], fj e [c ,l] 

From case 1 of (50) we know co — rj = a and hence rj = cq — a and the problem is reduced 
to determining 

min W(co — a) — W(fj) — (co — a — fi)W(fj). 

j?e[c ,i] 

Differentiating the above expression with respect to fj we obtain 
d — — 

— W(cq - a) - W(fj) - (co - a - fj)W(fj) = -(c - a- fj)w(fj) =: 7. (77) 

By assumption (for now) we have fj > co and thus 

co — a — fj G [co — a — 1, co — a — co] = \co^—^l—a,^-aj. (78) 

<o <o 

Equations 77 and 78 together imply that 7 > 0. Clearly 7 can only equal zero if w(fj) = 
for some fj £ [cq, 1]. Since the derivative is consequently everywhere non-negative, the 
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minimum occurs at the minimum value 77; that is at r) = cq. Subsituting for this value 
of 77 into (48) we obtain 

B w (v, fj) > W(c - a) - W(c ) + aW(c ). (79) 

If instead we have 77 > cq and 77 < cq, we have (from case 2 of (50)) that a = 77 — Co 
and thus r\ = a + Co and we need to determine 

min W(a + cq) — W(fj) — (a + cq — fj)W(fj). 

»?G[0,c o ] 

Again differentiating with respect to fj we obtain 
d _ 

— W(a + c ) - W(fj) - (a + c - 77)^(77) = -(a + c - fj)w(fj) =: 7. 

Furthermore we have 77 6 [0, Co] and so (a + Co — 77) S [a + Co — Co, a + co] and thus 7 > 
and can only equal zero if 7^(77) = 0. Since the derivative is consequently everywhere 
non-positive, the minimum occurs at the maximum possible value of 77 namely 77 = cq. 
Substituting for this value of 77 into (48) we obtain 

B W (V, fj) > W(c + a) - W(co) - aW(co). (80) 

Combining (79) and (80) gives (73). 

If Co = \ and w is symmetric about \ then for a G [0, J] we have 

w(\-a) = w(\ + a) 
=> J w(\ — a)da = j w{\ + a)da 

W{\)-W(\-a) = W(\ + a)-W(\) 

=> J W (\) ~ W (k ~ a ) da = j W (k + «) - W(\)da 

^ W{\-a) + aW{\) = W{\ + a)-aW{\), 

in which case (73) reduces to (74). 

We finally demonstrate the tightness of the bound. Since 77 = vrj^, by choosing and 
arbitrary ir G (0, 1) and M uniform on X we have rj(x) = 77 for all i£l Furthermore 
given any desired 77: X — > [0, 1] there exists a P and Q that generates 77(-). Furthermore 
fj(-) can be an arbitrary function on X. Thus one can take (77,77) to be induced by the 
arg min in (76). By the above construction there exists ??(•) and fj(-) such that the 
constructed (77, 77) are the corresponding values conditioned on x G X. Thus there exists 
(it, P, Q) such that (73) is obtained and thus the bound is tight. ■ 

So far we have glossed over the constants of integration implicit in defining W and W in 
terms of w. Replacing W(c) by W(c) + k\ and W(c) by W(c) + ck\ + &2 and substituting 
into (e.g) (74) we obtain 

W{\ + a) + k l {\ + a) + k 2 - aW{\) - ah -W{\)-ki~ k 2 (81) 

2 

= W(\ + a)-aW{\)-W{\) + ki/2-ki/2 + k 1 a-k 1 a + k 2 -k 2 . (82) 
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Thus the choice of the constants of integration do not affect the bound. 
One can take a Taylor series expansion of (74) in a about zero to obtain 

r, , M W (k) 2 W "(k) 4 

B w M)>-f l a 2 + -^-a A + --- 

There is no third order term since for (74) to hold w is symmetric and thus w'{\) = 0. 
This corresponds to the second order result presented in Buja et al. (2005). 

7.2 General Pinsker Inequalities for Divergences 

The many different / divergences are single number summaries of the relationship be- 
tween two distributions P and Q. Each /-divergence emphasises different aspects. 
Merely considering the functions / by which /-divergences are traditionally defined 
makes it hard to understand these different aspects, and harder still to understand how 
knowledge of 1^ constrains the possible values of I/ 2 . When 1^ = V (a special primitive 
for If) and I/ 2 = KL, this a classical problem that has been studied for decades. 

Vajda (1970) posed the question of a tight lower bound on KL-divergence in terms 
of variational divergence. This "best possible Pinsker inequality" takes the form 

L(V) := inf KL(P,Q), V e [0,2) (83) 
v(p,q)=v 

Recently Fedotov et al. Fedotov et al. (2003) presented an implicit (parametric) version 
of the form 

(V(t),L(t)) tm+ (84) 




V(t) = t 1 - coth(t) - , L(t) = log — - + t coth(t) - 



t 2 



tj I Vsmh(t) J sinh 2 (t) 



We will now show how viewing /-divergences in terms of their weighted integral rep- 
resentation simplifies the problem of understanding the relationship between different 
divergences and leads, amongst other things, to an explicit formula for (83). 

Fix a positive integer n. Consider a sequence < tt± < TT2 < ■ ■ ■ < ir n < 1- Suppose 
we "sampled" the value of AL(-7r, P, Q) at these discrete values of n. This is equivalent to 
knowing the values of the "narrowband" (so called because its weight is j(tt) = 45(tt— 7Tj)) 
primitive generalised variational divergence V ni (P,Q) := i ; i]X i7r .(P, Q). 

Since n *— > L(7r, P, Q) is concave, the piece-wise linear concave function passing 
through points {(7Tj, L 0_1 (7Tj, P, Q))}f =l is guaranteed to be an upper bound on the 
Bayes risk curve (it, L 0_1 (7r, P, Q)) ne (o,i)- This therefore gives a lower bound on the 
statistical information for a task with loss given by a weight function 7 and therefore a 
lower bound on the /-divergence If(P,Q) corresponding to the statistical information. 
This observation forms the basis of the theorem stated below. 

Theorem 27 For a positive integer n consider a sequence < tt\ < tt2 < • • • < vr n < 1. 
Let ttq := and ir n+ i := 1. Let 



fpi '■= L 0-1 ^) P, Q), i = 0,...,n+l 
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(observe that consequently V>o = Vwi = ®)- Let 

A n := i a = (ai, ...,a n ) G M : < a; < , z = 1, . . . ,n > . (85) 

L 7Tj+l — 7Ti 7Tj — 7Ti_l J 

T/ie set defines the allowable slopes of a piecewise linear function majorizing it 
L 0_1 (7r, P, Q) at each of 7Ti, . . . , 7r n . For a = (ai, . . . , a n ) G ^4 ra , /ei 

V'i - 0i+i + a i+ nr i+ i - aim . . . 

7Tj := , z = 0,...,n, (86) 

dj+l — Oj 

j : = {k G {l,...,n} : TT k < \ < w k+1 }. (87) 

*i ■= li<3lti + li = M + lj <il*i-i, (88) 

aa,i := [i<il(l-ai) + [i>j](-l-ai_i), (89) 

/9 a ,i := [* < j'lW'i - OiTTi) + P > j'KV'i-i - on-mi-i) (90) 

/or i = 0, . . . , n + 1 and let be the weight corresponding to f given by (40). 

For arbitrary If and for all distributions P and Q the following bound holds. If in 
addition X contains a connected component, it is tight. 

I f {P,Q) > min [ l+1 K,i7r + /M 7/ (7r)d7r ( 91 ) 
aeAn i= o J *i 

n 

= min ^ [(«a,i7Ti+i + /3a,i) r/(vf i+ i) - a^T /(7fj+i) 

a£A n . 

- (a a ,i7Ti + /?a,i) r/(7fj) + OL^iT /(7fj)] , (92) 

where T l /(7r) := f w jf(t)dt and Tf(w) := f w T f(t)dt. 

Equation 92 follows from (91) by integration by parts. The remainder of the proof is in 
appendix A. 5. Although (91) looks daunting, we observe: (1) the constraints on a are 
convex (in fact they are a box constraint); and (2) the objective is a relatively benign 
function of a. By theorem 15 and the fact that 2£i = £ 0_1 , ipi = L 0_1 (7Tj, P, Q) = 

2L ffi (|, P,Q). 

When n = 1 the result simplifies considerably If in addition tt\ = \ then Vi (P, Q) = 

V(P, Q) = 2- 4L 0_1 (i, P, Q). It is then a straightforward exercise to explicitly evaluate 
(91), especially when jf is symmetric. The following theorem expresses the result in 
terms of V(P, Q) for comparability with previous results. The result for KL(P, Q) is a 
(best-possible) improvement on the classical Pinsker inequality. The various divergences 
are defined in Table 1. 

Theorem 28 For any distributions P,Q on X, let V := V(P,Q). Then the following 
bounds hold and, if in addition X has a connected component, are tight. 
When 7 is symmetric about \ and convex, 

l f (P, Q) > 2 [f f (§-£) + %T f (I) - f f (§)] (93) 
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Figure 6: Lower bound on KL(P, Q) as a function of the variational divergence V(P, Q). 

Both the explicit bound (95) and Fedotorev et al.'s implicit bound (84) are 
plotted. 



and Tf and Tf are as in Theorem 27. The following special cases hold. 

h 2 (P, Q)>2- Vl-V*; J(P, Q) > 2V In (§±£) ; (P, Q) > ^ 
l(P, Q)>{\~ \) ln(2 - V) + (| + {) ln(2 + V) - ln(2) 
T(P,Q)>ln(^ ? )-ln(2). 

When 7 is not symmetric, the following special cases hold: 

X 2 (P.Q) > IV < 1]V 2 + IV > llp^j (94) 

KL < P - *» 2 w ^-vi (^) ln ( + ln (^) ' (95) 

This theorem gives the first explicit representation of the optimal Pinsker bound. 28 By 
plotting both (84) and (95) one can confirm that the two bounds (implicit and explicit) 
coincide; see Figure 6. Equation 93 should be compared with (74). 

The above theorem suggests a means by which one can estimate an /-divergence by 
estimating a sequence (L c . (tt, P, Q))f = i- A simpler version of such an idea (more directly 
using the representation (39)) has been studied by Song et al. (2008) 

28. A summary of existing results and their relationship to those presented here is given in appendix D. 
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8. Variational Representations 



We have already seen a number of connections between the Bayes risk 



L(7r,P,Q) = inf Ex.M[^(X),j)(X))] 
»?e[o,i] x 



and the /-divergence 



I f (P,Q) = K Q 



f 



dQJ 



(96) 



(97) 



Comparing these definitions leads to an obvious and intriguing point: the definition 
of L involves an optimisation, whereas that for If does not. Observe that the normal 
usage of these quantities is that one normally wishes to not just know the real number 
L(-7r, P, Q), but one would like the estimate fj: X — > [0, 1] that attains the minimal risk. 
In this section we will explore two views of If — relating the standard definition to a 
variational one that explains where the optimisation is hidden in (97). The easiest place 
to start, unsurprisingly!, is with the variational divergence. Below we derive a straight- 
forward extension of the classical result relating IL 0-1 ^, P, Q) to V(P, Q). We then 
explore variational representations for general /-divergences and consequently develop 
some new generalisations. 



8.1 Generalised Variational Divergence 

Let C C {— denote a collection of binary classifiers on X. Consider the (con- 
strained 29 ) Bayes risk for 0-1 loss minimised over this set 

h° e \7r,P,Q) = inf E (XiY) ^[^- 1 (r(X), Y)]. (98) 

The variational divergence is so called because it can be written 

V{P,Q) = 2 sup \P{A) - Q(A)\, (99) 

ACX 

where the supremum is over all measurable subsets of X. Since V(P, Q) = sup rg [_ l5l ]x |Epr— 
E<gr|, consider the following generalisation of V: 

V %7r (P,Q):=2 sup |7rE P r-(l-7r)E Q r|, (100) 

where tt G (0, 1). When ir = \ this is a scaled version of what Miiller (1997a,b) calls an 
integral probability metric. 30 

If 31 is symmetric about zero (r G % =^ — r G K), then the absolute value signs in 
(100) can be removed. To see this, suppose the supremum was attained at r and that 

29. Tong and Roller (2000) call this the restricted Bayes risk. 

30. Zolotarev (1984) calls this a probability metric with (structure. There are probability metrics that 
are neither /-divergences nor integral probability metrics. A large collection is due to Rachev (1991). 
A recent survey on relationships (inequalities and some representations) has been given by Gibbs 
and Su (2002). 
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a := 7rEpr — (1 — 7r)Egr < 0. Choose r' := — r and observe that 7rEpr' — (1 — ir)E,Qr' = 
-a > 0. Thus V %n {P,Q) = 2su Vrem _ hl] x{7rE P r - (1 - vr)E Q r). 

Let sgnft := {sgnr: r G 3?} and for a, 6 G R, let aft + o := {ar + b: r G %}. 

Theorem 29 Suppose ft C [— 1,1] x is symmetric about zero and sgnft C JR. For all 
ir G (0, 1) and all P and Q 

MSou-W*. P '^) = 3 - 0) ( 101 ) 

anci i/ie infimum in (98) corresponds to the supremum in (100). 
Proof Let 6 := (sgnft + l)/2 C {0,1} X and so sgnft = 26-1. Then 
J£-\*,P,Q) = infE (X)Y) ^°- 1 (r(X),Y) 

= inf (7rE x ^°- 1 (r(X), 0) + (1 - 7r)E x ^Q^- 1 (r(X), 1)) 
ree 7 

= inf (ttEx^pHX) = 11 + (1 " vr)E x ^ Q [r(X) = 0]) 

= inf (vrEpr + (1 - tt)Eq(1 - r)) 

ree 

since Ranr = {0, 1} E x ^ P [r(X) = 1] = E x ^pr(X) and E x ^ Q [r(X) = 0] = E X ^ Q (1 - 
r(X)). Let p = 2r - 1 G 26 - 1. Thus r = Hence 

= ^ e if e f _ 1 (*(P + 1 ) + ( 1 -^ IE 0( 1 -P)) 

= 2 pe W-i (7rEp/9 + (1 " + t + (1 - t)) 

= \~\ sup (7rE P (-p)-(l-7r)E Q (-p)). 

* * pe2e-i 

Since ft is symmetric about zero, sgn(ft) = 26 — 1, 6 C {0, 1} X is symmetric about 1; 
i.e. p G 6 =► (1 - p) G 6. Thus 

^(tt.^Q) = \-\ sup (vrEpp - (1 - ^)E Q p) 
^ ^ pe2e-i 

= ^-^ 2e _ 1)W (P,Q) 

= \-\v^^{P,Q). (102) 

Since by assumption sgnft C ft, the supremum in (100) will be ±1- valued everywhere. 
Thus V sgn0i ^(P,Q) = V %n (P,Q). Combining this fact with (102) leads to (101). 

Finally observe that by replacing inf and sup by arg min and arg max the final part 
of the theorem is apparent. ■ 
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8.1.1 The Linear "Loss" 

This theorem shows that computing V^ j7r involves an optimisation problem equivalent 
to that arising in the determination of L. The arg min in the definition of L is usually 
called the hypothesis (or Bayes optimal hypothesis). Following Borgwardt et al. (2006) 
we will call the arg max in (100) the witness. 

When ft = [— 1, l] x and 7T = \, sgnft C ft and furthermore Q = (sgnft + l)/2 = 
{0, 1} X and so Theorem 29 reduces to the classical result that L 0_1 (i,P, Q) = \ — 
\V(P,Q) (Devroye et al., 1996). 

The requirement that sgnft C ft is unattractive. It is necessitated by the use of 0-1 
loss. It can be removed by instead considering the linear loss. 

It is convenient to temporarily switch conventions so that the labels y £ { — 1,1}. 
Consider the linear loss 

f m (r(x),y) :=l-yr(x), y e {-1,1}. 

If r is unrestricted, then there is no guarantee that £ hn > — oo and is thus a legitimate 
loss function. Below we will always consider r G ft such that the linear loss is bounded 
from below. Observe that the common hinge loss (Steinwart and Christmann, 2008) is 
simply e hin % c (f(x), y) = V £ lin (f(x),y). 

Theorem 30 Assume that ft C [— a, a] x for some a > and is symmetric about zero. 
Then for all ir £ (0, 1) and all distributions P and Q on X 

L'i>, P, Q) = 1 - \v % «{P, Q) (103) 

and the r that attains L^ n (7r, P, Q) corresponds to the r that obtains the supremum in 
the definition ofV^ :7r (P,Q). 

Proof 

L^(tt,P,Q) = inf (^E x ^ lin (r(X),-l) + (l-^)ExV in (KX),+l)) (104) 

rex \ / 

= inf (ttE x ~p(1 + r(X)) + (1 - 7t)E x ~q(1 - r(X))) 

= inf (7T + 7rEpr + (l -vr) - (1 -7r)E r) 
reOi 

= 1 + inf (vrEpr - (1 - 7r)E Q r) (105) 
= l-sup(7rE P (-r)-(l-ir)E (-r)) 

reft 

= 1 — sup (7rEpr — (1 — 7r)Egr) 
reft 

= l- l -V % «{P,Q), 

where the penultimate step exploits the symmetry of ft. ■ 
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Now suppose that ft = := {r: \\r\\w < 1}, the unit ball in 5f, a Reproducing 
Kernel Hilbert Space (RKHS) (Scholkopf and Smola, 2002). Thus for all r G ft there 
exists a feature map 4>: X — ► Ji such that r(x) = (r, (j)( x ))^c an d (<p{x) : (f>(y))ji = k(x, y), 
where k is a positive definite kernel function. Borgwardt et al. (2006) show that 

v b x ,^Q) = \\\^P<t>-^\\li- (106) 

Thus 

V£\n,P,Q) = 1 - ^||E P - Eq</>\\ m . (107) 

Empirical estimators derived from the correspondence between (106) and (107) lead to 
the z/-Support Vector Machine and Maximum Mean Discrepancy; see appendix F. 
Let aco ft denote the absolute convex hull of ft: 



acoft := 



j^a^: n € ft, Y^\ai\ < 1, on € r| 



The following is a minor generalisation of a result due to Miiller (1997a). 
Theorem 31 For all P, Q and it £ (0, 1), V aco%7T (P, Q) = V %7r (P, Q). 
Proof Let B 1 : = {(a^: ^ H < 1}. Then 

^oO^PQ) = 2 sup vrEpr - (1 - vr)E Q r 

rGaco 3? 

= 2 sup sup 7rEp ^2 a i r i ~ (1 _ 7r ) E Q 

(a»)ieBi {rJiCS j j 

= 2 sup sup ^ ai (nEpri - (1 - 7r)Egri) 
{oiiheBi {ri}iC0i j 

= 2 sup V^Oj sup (wEprj - (1 - 7r)EQri) 
K)ieBi j nex 

= 2 sup J>i% )7r (P,Q) 

(ai)»eBi j 

= V^, W (P,Q). 



Combining this theorem with Theorem 30 shows that for all P, Q and all 7r G (0, 1), 

that is, taking the absolute convex hull does not change the Bayes risk when using linear 
loss. Let S(P,Q,3,e) := Pr{Lj- n (7r, P, Q) - L^(7r, P n , Qn) > e} denote the probability 
of being misled by more than e on a sample of size n, where P n and Q n are the respective 
empirical distributions induced by the empirical distribution P n . Since P, Q are arbitrary 
in (108) we conclude that S(P, Q, aco(9 r ), e) = S(P,Q,3,e), which is hinted at by the 
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Rademacher average upper bounds on sample complexity and invariance to forming 
absolute convex hulls (Bartlett and Mendelson, 2002), but as far as we are aware has 
never been stated as above. Note that the use of linear loss is essential here and it is only 
well defined for suitable 3~. Appendix F shows that the standard SVM can be derived 
using linear loss. 

8.2 Variational Representation of If and its Generalizations 

The variational representation of the Variational divergence (99) suggests the question 
of whether there is a variational representation for a general /-divergence. This has been 
considered previously. We briefly summarise the approach, and then explore some (new) 
implications of the representation. 

One can obtain a variational representation for 1/ by substituting a variational rep- 
resentation for / into the definition of If (Keziou, 2003a, b; Broniatowski, 2004; Bronia- 
towski and Keziou, 2009). Let p and q denote the densities corresponding to P and Q and 
assume for now they exist. Recall from Section 2.2 above, that the Legendre-Fenchel con- 
jugate of / is given by f*(s) = sup ueDom j us— f(u). In general Ran /* = R* := MU{+oo}. 
Since f(u) = sup p6K up — f*(p), we can write 

I f (P,Q) = [ q(x) sup (p^- -r(p)) dx 

= sup / p(x)p(x) - f*(p(x))q(x)dx. 
p<=R x Jx 

= sup(E P p-E Q f*(p)). 

We make this concrete by considering the variational divergence. The corresponding / 
is given by f(t) = \t — 1| and (adopting the convention that • oo = 0) f*(x) = [x 
[—1, l]]oo+[x G [—1, l]]x. Since the supremum in (109) will not be attained if the second 
term is infinite, one can restrict the supremum to be over 3" = {p G M x : ||p||oo < !}■ 
Thus 

V(P,Q) = sup (E P p-E Q p)= sup (Epp-E Q p) 
p- ||p||oo<i pe{-i,i} x 

sup (E P p-E QP )=2 sup (Epp-E Q p) = 2 sup \P(A) - Q(A)\ 
pe{0,2}* P e{o,i} x A 

since the supremum will be attained for functions p taking on values only in { — 1,1} and 
the remaining steps are simply a shift and rescaling (to {0, 2} by adding 1, and then to 
{0,1})- 

The representation (109) suggests the generalisation 

%fM P >Q) := SU P i p{x)p{x) - f{p{x))q{x)dx 
pg^cir* JX 

= sup(E P p-E /*(p)). 
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Observing this is not symmetric in p and q suggests a further generalisation: 

*f,gM P >Q) ■= SU P / -9*(p(x))p(x) - f*(p(x))q(x)dx (109) 
= S up(-E P g*(p) - E Q f*(p)). (110) 

Here g* is the Revalued LF conjugate of a convex function g. Set := I/ 9 rx. 
An alternative generalisation of 1/ is 

Q) := sup (Ep^(p) - Eq/* (p)) (HI) 

which is identical to (109) except for removal of the minus sign preceeding g*. Set 
:= Ij^jjx. If p € $ are such that ||p||oo is unbounded, then in general If >g ^(P,Q) 
will be infinite. Properties of the alternative definition relate to the extended infimal 
convolution between two convex functions. 

Definition 32 Suppose f, g : M + — ► K* are convex. The extended infimal convolution is 
(/□<?)(r) := inf f(x) + rg(x/r), r G R+. 

Note that the second term in this convolution is the perspective function (Section 2.1) 
applied to g, that is, I g {x,r). 

Theorem 33 Suppose f, g : R + — > M* are convex. Then 

1. I f (P,Q) = I m x(P,Q),I f;iA ,?(P,Q) = I f: ?(P,Q), andI t „\ t _ lU3 {P,Q) = 2V _i(P,Q). 

' 2 

^ I/i.si,? = ^2, 32,^ */ h~ h = fa and gi - g 2 = g a and /i, f 2 , f a , 9i,92,g a are 
affine. 

3- lfj,s = Iid,id,/*(3o(P Q)- 

I If j,? = f id ,id,/*(?)(P Q) = 2V> (?) (P, Q). 

Proof Part 1 follows immediately from the various definitions. Since affine functions 
are the only functions that are simultaneously convex and concave, = I/ 2 ,g2,3 r 

only if /i, /2 (resp. gi,g 2 ) are affine and their differences are affine (since an affine offset 
will not change I). This proves part 2. 
We have by change of variables 

! /i/)? (P, Q) = sup(E P /*(p) - E /*(P)) = sup (Ep^ - E Q V) = iid,id,/*(5)( P ' Q)> 

(112) 
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where /*(3~) := {/* o p: p £ 1}. (Exactly the same argument applies to Ifj^ although 
su Vip£g*(3*)(— ^pip — EqV) does not correspond to a generalised variational divergence.) 
This proves parts 3 and 4. 



The proof of part 5 is in Appendix A. 4. It suggests the question: given a suitable convex 
/, does there always exist g such that / = gOg? This is analogous to the question of 
spectral factorisation (Sayed and Kailath, 2001) for ordinary linear convolution. We do 
not know the answer to this question, but have collected a few examples in Appendix E 
that demonstrates it is certainly true for some f. There does not appear to be a result 
analogous to part 5 of Theorem 33 for If g . 

We have seen how /-divergences are related to integral probability metrics V$. It 
turns out that the variational divergence is special in being both. Many integral proba- 
bility metrics are true metrics (Midler, 1997a,b). The only /-divergence that is a metric 
is the variational divergence. Whether there exist 3~ such that Vj-(-,-) is not a metric 
but equals !/(-,•) for some / / 1 1— ► \t — 1| (or affine transformation thereof) is left as an 
open problem. 

We end with another open problem. We have seen how Lj- and Vg? are related. This 
begs the question whether there is a representation of the form 



9. Conclusions 

There are several existing concepts that can be used to quantify the amount of in- 
formation in a task and its difficulty: Uncertainty, Bregman information, statistical 
information, Bayes risk and regret, and /-divergences. Information is a difference in un- 
certainty; regret is a difference in risk. In the case of supervised binary class probability 
estimation, we have connected and extended several existing results in the literature to 
show how to translate between these perspectives. The representations allow a precise 
answer to the question of what are the primitives for binary experiments. 

We have derived the integral representations in a simple and unified manner, and 
illustrated the value of the representations. Along the way we have drawn connections 
to a diverse set of concepts related to binary experiments: risk curves, cost curves, ROC 
curves and the area under them; variational representations of /-divergences, risks and 
regrets. 

Two key consequences are surrogate regret bounds that are at once more general 
and simpler than those in the literature, and a generalisation of the classical Pinkser 
inequality providing, inter alia, an explicit form for the best possible Pinsker inequality 
relating Kullback-Liebler divergence and Variational divergence. The parametrisation 
of regret in terms of weighted integral representations also shows the connection with 
matching losses and provides a simple proof of the convexity of the composite loss induced 
by a proper scoring rule with its canonical link function. We have also presented a 
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Given 


Assumed 


Derived 


(P,Q) 


/ 1 






U ^w,W,W 


J(C/(r ? )) = AL(7r,P,Q) 






Un) 




V 





Table 2: Summary relationships between key objects arising in Binary Experiments. 

"Given" indicates the object is given or provided by the world; "Assumed" 
is something the user of assumes or imposes in order to create a well defined 
problem; "Derived" indicates quantities that are derived from the primitives. 



new derivation of support vector machines and their relationship to Maximum Mean 
Discrepancy (integral probability metrics). 

The key relationships between the basic objects of study are summarised in Table 2 
and Figure 7. 

All of the results we present demonstrate the fundamental and elementary nature of 
the cost- weighted misclassification loss, which is becoming increasingly appreciated in 
the Machine Learning literature (Bach et al., 2006; Beygelzimer et al., 2008). 

More generally, the present work is small part of a larger research agenda to under- 
stand the whole field of machine learning in terms of relations between problems. We 
envisage these relations being richer and more powerful than the already valuable reduc- 
tions between learning problems. Much of the present literature on machine learning is 
highly solution focussed. Of course one does indeed like to solve problems, and we do 
not suggest otherwise. But it is hard to see structure in the panoply of solutions which 
continue to grow each year. The present paper is a first step to a pluralistic unification of 
a diverse set of machine learning problems. The goal we have in mind can be explained 
by analogy: 

Within the field of computational complexity (especially NP-completeness): Garey 
and Johnson (1979); Johnson (1982-1992; 2005-2007) lead to a detailed and structured 
understanding of the relationships between many fundamental problems and conse- 
quently guides the search for solutions for new problems. Compare Machine Learning 
problems with mathematical functions. In the 19th century, each function was consid- 
ered seperately. Functional Analysis (Dieudonne, 1981) catalogued them by considering 
sets of functions and relations (mappings) between them and subsequently developed 
many new and powerful tools. The increasing abstraction and focus on relations has 
remained a powerful force in mathematics (Wikipedia, 2007). A systematic cataloging 
(taxonomy) resonates with Biology's Linnean past — and taxonomies can indeed lead 
to standardisation and efficiency (Bowker and Star, 1999). But taxonomies alone are 
inadequate — it seems necessary to understand the relationships in a manner analogous 
to Systems Biology which "is about putting together rather than taking apart, integra- 
tion rather than reduction. . . . Successful integration at the systems level must be built 
on successful reduction, but reduction alone is far from sufficient." (Noble, 2006). Fi- 
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nally, Lyell's Principles of Geology (Lyell, 1830) was a watershed in Geology's history 
(Bowker, 2005); prior work is pre-historical. Lyell's key insight was to explain the huge 
diversity of geological formations in terms of a relative simple set of transformations 
applied repeatedly. 

These analogies encourage our aspiration that by more systematically understanding 
the relationships between machine learning problems and how they can be transformed 
into each other, we will develop a better organised and more powerful toolkit for solving 
existing and future problems, and will make progress along the lines suggested by Hand 
(1994). 
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Appendix A. Proofs 
A.l Proof of Corollary 3 

Integration by parts of t<p"{t) gives t <fi"(t) dt = 0'(1) — (0(1) — 0(0)) which can be 
rearranged to give 




Substituting this into the Taylor expansion of <fi(s) about 1 yields 





i 



i 



t<f/'(t)dt + (<f>(l)-<f>(0)) (s-l)+ / {t-s) + 4>"{t)dt 



4>(i) + (0(i) - 0(o))(s - l) + C t(s - i) dt + f\t- s)+ 4f\t) dt 




where ip(s, t) := min{(l — t)s, (1 — s)t} 



This form of ip is valid since 




J s — ts, t > s 
1 1 — ts, t < s 
= min{(l — t)s, (1 



s)t} 
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as required. 



A.2 Proof of Theorem 6 

Expanding the definition of the Jensen gap using the definition of tp gives 

J M [V(S)] = E M [^(S)]-^(E M [S]) 

= E^(S) + bS + a] - (0(E„[S]) + 6E M [S] + a) 
= E M [0(S)] + 6E M [S] + a - 0(E M [S]) - 6E M [S] - a 
= «(S)] 

as required. 

A.3 Proof of Theorem 9 

Proof Given a task (it, P, Q; i) we need to first check that 

f{t) := L(vr) - (nt + 1 - vr)L ( - _ - ) (113) 

is convex and that /^(l) = 0. This latter fact is obtained immediately by substituting 
t = 1 into f w (t) yielding L(ir) — L(tt) = 0. The convexity of f n is guaranteed by 
Theorem 7, which shows that L is concave and the fact that the perspective transform 
of a convex function is always convex (see Section 2.1). Thus the function 



1 1 ^ I-L(nt, irt + 1 - 7r) = -(vrf + 1 - vr)L 



Tit 



7Tt + 1 — 7T 



is the composition of a convex function and an affine one and therefore convex. 
Substituting (113) into the definition of /-divergence in (17) yields 



E Q [r(dP/dQ)} = E 



, dP 1 \ / TTdP 

L(7r) - 7T— + 1 - 7T L 



J - \irdP + (1 - Tr)dQ 



since dM = -ndP + (1 — w)dQ. Recall that r? = irdP/dM. As L{jt) is constant we note 
that L(vr) = E M [L(tt)\ = L(vr, M) and so 

E Q \r{dP/dQ)] = LW-E M [L(7|)] 

= L(tt, M) - L(r?, M) 

= AL(r?,M) 

as required for the forward direction. 
Starting with 



1 — 7T \ 7T 1—7? 



60 



and substituting into the definition of statistical information in (27) gives us 
&L'(V,M) = E u [£"(*)] - E M [£"(»!)] 

J x I - it ii 1-t \ JT l-r| 



\dQj 



Jx 

since /(l) = 0, dQ = (1 - r/)/(l - vr)dM and 

1 — 7T T] 



dP/dQ = 



TT 1—7/ 

by the discussion in Section 4.1. This proves the converse statement of the theorem. 



A.4 Proof of part 5 of Theorem 33 

We need the following lemma. 

Lemma 34 Let f : M. — > M and if : ExR->l 6e convex and bounded from below. Then 
the extended infimal convolution 

(fDK)(x) = mif(y) + K(x,y), xGR 

2/6M 

is convex in x G R. 

Observe that if if (x, y) = g(x — y) for convex g, then /DK = / © g, the standard infimal 
convolution (Hiriart-Urruty and Lemarechal, 1993b). This extended infimal convolution 
seems little studied apart from by Cepedello-Boiso (1998). 

Proof Let f(x,y) := f(y), x G R. Clearly / is convex on R x R. Let h(x,y) = 
f(x,y) + K(x,y). Hiriart-Urruty and Lemarechal (1993b, Proposition 2.1.1) show that 
/i is convex on R x R. Observe that (/□if)(x) = inf{/i(x, y): y £ R}, i.e. the marginal 
function of /i. Since by construction /i is bounded from below, using the result of Hiriart- 
Urruty and Lemarechal (1993b, p. 169) proves the result. ■ 



Corollary 35 For any convex f and g, fUg is convex. 

Proof Observe that (fOg)(x) = mi yeR + f(y) + xg{y/x) = inf J/6R+ f(y) + I g (x,y), 
x G M + , where I g is the perspective function (1). Hiriart-Urruty and Lemarechal (1993b, 
Proposition 2.2.1) show that if g: K n — > R is convex then the perspective I g is convex 
on M n+1 . The corollary then follows from the lemma. ■ 



Proof (part 5 of Theorem 33) Observe that if h(x) = t<j)(x) then the LF conjugate 
h*(s) = t(j)(s/t). Thus using the Fenchel duality theorem (Rockafellar, 1970) we have, 
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using (Rockafellar and Wets, 2004, Theorem 14.60) to justify the swapping the order of 
the supremum and integration, 

If, g (P,Q) = sup ! -g*(p(x))p(x) - f*(p(x))q(x)dx 
P efi x « 

sup-g*(p)p(x) - f(p)q(x)dx 
x peM 



inf/ [i- 



n sk Ur 



inf q(x)f ( —— ) +p{x)g ( -j— ) dx, 
xpgm \q{x)J \p{x) 

if,g(p, q){x)dx 



X 



where 



*/ lfl (P,g)(-) : = in f «(•)/ ( 

P m \q{-) 

Let x := £ S 1 + . Thus p = x<? and 



Let r = 2 e R + . Thus 



if,g(Pi<l)( T ) 



inf o/(x) + pg(x/r) 



inf /(x) +T5f(x/r) 



= g-(/D 5 )(r). 

Let /i := /Dp. Observe from (114) that if, g (p,q) = qh(p/q) and thus 

dx = I ft (p,g) 



if /i is convex, which we know to be the case from Corollary 35. 



(114) 



A. 5 Pinsker Theorems 

Proof (Theorem 27) Given a binary experiment (P, Q) denote the corresponding 
statistical information as 



<K*) = <t>(P,Q)(*) :=AL°- 1 (7r,P,Q) = 7rA(l- 7 r)-V; ( p )Q) (7r), 



(115) 



where V ; (p,q)(tt) = 4 ) ( 7T ) = L° 1 ( 7r ,P,Q). We know that V is non-negative and concave 
and satisfies tpi^) < vr A (1 — 7r) and thus i/j(0) = -0(1) = 0. 
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Since 

I f (P,Q)= [ 0(vr) 7/ (vr)d7r, (116) 
J o 

Q) is minimized by minimizing (f>(p : Q) over all (P, Q) such that 

<P(n) = <i>i = n a (l - 7Tj) - ^ (P) Q)(7rj). 

Let V'i := VK 71 "*) = | — jKriCPjQ)- The problem becomes: 

Given (^,^07=1 find the maximal tp: [0, 1] — ► [0, \\ such that (117) 
i/>(iri) = fa, i = 0,...,n+l, (118) 

^(tt) < 7T A(l -7T), 7T€[0,1], (119) 

V> is concave. (120) 

This will tell us the optimal <p to use since optimising over tp is equivalent to optimizing 
over L(-, P, Q). Under the additional assumption on X, Corollary 25 implies that for any 
tp satisfying (118), (119) and (120) there exists P,Q such that L(-,P,Q) = tp(-). 

Let be the set of piecewise linear concave functions on [0, 1] having n + 1 segments 
such that tp G * =>■ V satisfies (118) and (119). We now show that in order to solve 
(117) it suffices to consider tp G 

If <7 is a concave function on R, then 

dg(x) :={s£R:g(y) < g(x) + (s, y - x), y G R} 

denote the sup- differential of g at x. (This is the obvious analogue of the stifr-differential 
for convex functions (Rockafellar, 1970).) Suppose tp is a general concave function sat- 
isfying (118) and (119). For i = 1, . . . , n, let 

Gf := { [0, 1] 3 gf : ir t ^ ^ G R is linear and ^/(tt) G 9fe) ) . 

Observe that by concavity, for all concave V> satisfying (118) and (119), for all g G 

Ur=iGf,ff(vr)>V(vr),vrG[0,l]. 

Thus given any such tp, one can always construct 

^(7r)=min( 5 f(7r),...,^(7r)) (121) 

such that tp* is concave, satisfies (118) and V'*( 7r ) > V'( 7r )) for au ^ G [0,1]. It remains 
to take account of (119). That is trivially done by setting 

^(7r) = miii(^*(7r),7rA(l-7r)) (122) 

which remains concave and piecewise linear (although with potentially one additional 
linear segment). Finally, the pointwise smallest concave tp satisfying (118) and (119) is 
the piecewise linear function connecting the points (0, 0), (tti, tpi), . . . , (7r m , tp m ), (1,0). 
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Figure 8: Illustration of construction of optimal </>(vr) = L(7r, P, Q). The optimal (f> is 
piece-wise linear such that </>(vrj) = 4>i, i = 0, . . . , n + 1. 



Let 5: [0, 1] — ► [0, fj] be this function which can be written explicitly as 

= ^ . + Wi+i - 1>)(* - e i = 0, . . . , n, 

where we have defined ttq := 0, ipo := 0, 7r n+ i := 1 and := 0. 

We now explicitly parametrize this family of functions. Let pi : [0, 1] — ► R denote 
the affine segment the graph of which passes through (7^,^), i = 0, ...,n + 1. Write 
Pi(?r) = aj7r + 6j. We know that Pi(vrj) = ^ and thus 

h = 4>i - am, i = 0,...,n + l. (123) 

In order to determine the constraints on Oj, since g is concave and minorizes ip, it 
suffices to only consider (vrj_i, 5(^-1)) and (^+1,5(^+1)) for i = 1, . . . , n. We have (for 
i = l,...,n) 

Pi(vri-i) > giiti-i) 
=^ aj7Tj_i + 6j > tpi-i 

Oj (7Ti_l - 7Ti) > - Ipi 

V v ' 

a-° < jM^L , (124 ) 

7Ti— 1 — 7Tj 
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Similarly we have (for i = 1, . . . ,n) 



=> am+i + bi > ipi+i 

Cti (7Tj+l - TTi) > 1pi+l ~ A 
v ' 

=4> CLi > . (125) 

We now determine the points at which ip defined by (121) and (122) change slope. That 
occurs at the points n when 

=> (litr + ipi- diiTi = a i+ i-K + ipi+i - a i+ iir i+ i 

=> (a i+1 - a;)7r =ipi- ipi+i + a i+ iir i+1 - am 

ipi - tpi+i + a i+ iTT i+ i 

=^ 7T = 

=: Tti 

for i = 0, . . . , n. Thus 

YK 71 ") = Pi(7r), tt G [7fi_i,7ri], i = 1, . . . ,n. 

Let a = (ai, . . . , a n ). We explicitly denote the dependence of -0 on a by writing a . Let 

a (7T) := 7T A (1 — 7r) — -0a(vr) 

= a^-K + /3 a ,i, 7r G [7fj_i,7fj], i = 1, . . . ,n + 1, 

where & e A n (see (85)), 7fj, a ai j and /9 a> j are defined by (88), (89) and (90) respectively. 
The extra segment induced at index j (see (87)) is needed since it i— > 7r A (1 — 7r) has 
a slope change at 7r = g. Thus in general, a is piecewise linear with n + 2 segments 
(recall i ranges from to n + 2); if 7T£. +1 = | for some k G {1, . . . , n}, then there will be 
only n + 1 non-trivial segments. 
Thus 



>-> ^ 0a(vi") • [7T G fa, 7T i+ l]] : a G A r , 



I i=0 J 

is the set of 4> consistent with the constraints and A n is defined in (85). Thus substi- 
tuting into (116), interchanging the order of summation and integration and optimizing 
we have shown (91). The tightness has already been argued: under the additional as- 
sumption on X, since there is no slop in the argument above since every <p satisfying the 
constraints is the Bayes risk function for some (P, Q). ■ 
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Figure 9: The optimisation problem when n = 1. Given <f>i, there are many risk curves 
consistent with it. The optimisation problem involves finding the risk curve 
that maximises I/. L and U are defined in the text. 



Proof (Theorem 28) In this case n = 1 and the optimal if) function will be piecewise 
linear, concave, and its graph will pass through (iti,i/ji). Thus the optimal (f> will be of 
the form 

f 0, tt £ [0, L] U [U, 1] 

0(vr) = 1 vr-(avr + 6), tt € [L, |] 

I (l-7r)-(o7r + 6), ttG [|,Z7]. 

where airi + b = Tpi^b = ipi— airi and o G [—2^1, 2^i] (see Figure 9). For variational 
divergence, tt\ = \ and thus 

V 1 F 

= TT! A (1 - 7q) - - = - - - (126) 

and so (?i>i = V/4. We can thus determine L and U : 

aL + b = L 
=^ aL + ipi — ani = L 

=> L — — . 

a — I 
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Similarly a U + b = l-U^U = 1 ~^+ a ^ and thus 



i _ v , i+ a7r i 

o+l 



If(P,Q)> min / [(1— a)-7r— -0i+a7ri]7/(7r)(i7r+ / [(— a— 1)tt— ■ipi+aTTi+l]jf(7r)d7r. 
<it-__--h.-x.2vx J J 



air-i —ipi 
a-1 



(127) 

If 7j is symmetric about ir = | (so by Corollary 13 If is symmetric) and convex and 
7Ti = 2, then the optimal a = 0. Thus in that case, 

i 

lf(P,Q) > 2 ['(TT-^fi^dTT (128) 

= 2[(|-^ 1 )r / (i) + f / (Vi)-f / (|)] 

= 2^(1) + f / (J-^)-f / (J)]. (129) 
Combining the above with (126) leads to a range of Pinsker style bounds for symmetric 

Jeffrey's Divergence Since J(P, Q) = KL(P, Q)+KL(Q, P) we have j(ir) = ^^_ n)2 + 
= ^t^- (As a check, /(<) = (t - 1) ln(t), /"(*) = ^ and so 7/ (vr) = 

^/ , ( i ^) = ^F-) Thus 

/•1/2 (tt-^) 

J(P,Q) > 2 / 1 Wl > dn 
Jj,. vr 2 (l - vr) 2 

= (4^i-2)(ln(Vi)-ln(l-^)). 
Substituting ^\ = \ — \ gives 

'2 + T/ N 



J(P, Q) > yin 



2 — V 



Observe that the above bound behaves like V 2 for small V, and Fin (fry) — ^ 2 
for V G [0, 2]. Using the traditional Pinkser inequality (KL(P, Q) > y 2 /2) we have 

J(P,Q) = KL(P,Q) + KL(Q,P) 

y 2 y 2 
> 1 

~ 2 2 
= V 2 

Jensen-Shannon Divergence Here /(i) = | In t— ln(t+l)+ln 2 and thus 7/(71") = 
^ (^r) = MT^)- Thus 



JS(P,Q) = 2 l~ 2 I \ 

Jibi 27T(1 - 7Tj 



/V>i 27r ( 1 - 7r ) 
ln(l - Vi) — -01 l n (! -V'l) + 01 InV'i +ln(2). 
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Substituting ipi = | — ^ leads to 

JS(P, Q) > Q - ^) ln(2 -VO+Q + j) ln(2 + F) - ln(2). 
Hellinger Divergence Here f(t) = — l) 2 . Consequently 7/(7r) = ^f" (^j?) = 

^ 2((l-n)/nf/ 2 = 2^(1-^3/2 aIld tllUS 

2 k(! " i")] 
4ygl(^i - 1) +2y/l -^i 



= 4y^V q-^-i)+2 v /i-^ + ^ 
2 (2 + y)V2^T 

= 2 - \/4 - V 2 . 
For small V, 2 - V4 - F 2 « V 2 /4. 

Arithmetic-Geometric Mean Divergence Here /(i) = ^ In (|^)- Thus f"(t) 
and hence 7/ (vr) = (l=s) = 7/ (vr) = and thus 



T(P,Q) > 2 [\k- ^ 2 ^^ dir 



-^ln(l-V)-^ln(^)-ln(2). 



Substituting i ) \ = \ — \ gives 

™ £ "I 1 - (I + t) - 1 ta (I - T » - ^ 
= ta (CT)- ln<2) - 

Symmetric x 2 -Divergence Here *(P, Q) = X 2 (-P, Q) + X 2 (Q> -P) an d thus (see below) 
7/W = £ + (x^. (As a check, from /(*) = i^±H we have /"(t) = 
and thus 7/(7r) = ^gf" (^ L ) gives the same result.) 



7T 3 (1 — 7r) 3 



\P(P,Q) > 2 /'(vr-Vi) f^ + TT^v^ ) ' / ' 
2(1 + 4^-4^) 

V'i(V'i-i) 
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Substituting ^\ = \- \ gives > 

When 7/ is not symmetric, one needs to use (127) instead of the simpler (129). We 
consider two special cases. 

X 2 -Divergence Here f(t) = (t- l) 2 and so f"(t) = 2 and hence 7(71") = /" (^) /vr 3 = 
\ which is not symmetric. Upon substituting 2/tt 3 for 7(71") in (127) and evaluating 
the integrals we obtain 

2^d^^o ■ 1 + 4^-4^1 1 + 4^-4-01 
X Q) > 2 mm 



ae[-2^i,2^i] 2^i — a 2-01 — a — 2 

V v ' 

=:J(a,V>i) 

One can then solve ^J(a,-0i) = for a and one obtains a* = 2ipi — 1. Now 
a* > —2^1 only if -01 > j- One can check that when tpi < \, then a 1— > J(a, -0i) 
is monotonically increasing for a £ [— 2^i, 2-0i] and hence the minimum occurs at 
a* = —20i. Thus the value of a minimising J(a,ipi) is 

a* = [Y>i > 1/4] (2^i - 1) + [Vi < 1/4] (-2^i). 
Substituting the optimal value of a* into J(a, 0i) we obtain 

'1 + 4-0? -4^ i + 4^2_ 4 ^ 



J(a*,Vi) = [V>i > l/4](2+8^-8^i)+[Vi < 1/4] 



4V> 4-01 - 2 



Substituting ty\ = \ — \ and observing that V < 1 => 0i > 1/4 we obtain 

x 2 OP,Q)>[v<i]y 2 + [y>i]- 1 



\2-vy 

Observe that the bound diverges to 00 as V — ► 2. 

Kullback-Leibler Divergence In this case we have /(t) = tint and thus f"(t) = 1/t 
and 7/(71") = ^5/" i}-^) = ^(jz^j which is clearly not symmetric. From (127) we 
obtain 

KL(P,Q)> min ( 1 - ? - fr) In ( ° + ^ " - ) + ( - + 0i) In ( ° + ^ ) 
V ' V; - [-2^1,2^1] V 2 V a -2^i y V2 *V Va- 2 ^i+ 2 / 

Substituting ^\ = \ — \ gives KL(P, Q) > min aC ^v-2 2-v j 5 a (V), where 5 a (V) = 
(^) ^ (f^) + (**f=*) ^ SetV := 2a and we have (95). 
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Appendix B. Summary of Previous "Monistic" Approaches to 
Unification 



There are are range of different approaches to unifying machine learning from a monistic 
perspective: 

Low level data interchange: There is a small amount of work on developing standards 
for interchanging data sets (Grossman et al., 2002; Carey et al., 2007; Wettschereck and 
Muller, 2001) — this is analogous to PDDL (Ghallab et al., 1998). There are also some 
limited higher level attempts such as ontologies (Soldatova and King, 2006) and general 
frameworks (Fayyad et al., 1996). 

Modelling frameworks: To solve a machine learning problem, one needs models. 
There is a rich literature on graphical modelsJordan (1999), factor graphs (Kschischang 
et al., 2001) and Markov logic networks (Domingos and Richardson, 2004; Richardson 
and Domingos, 2006) which have allowed the unification of sets of problems (Worthen and 
Stark, 2001), with a focus on the modelling and computational techniques for particular 
problems. 

Comparison of frameworks: There are several philosophical frameworks/approaches 
to designing inference and learning algorithms. There are several works (Barnett, 1999; 
Bayarri and Berger, 2004; Berger, 2003) that compare and contrast these. They are 
effectively comparing different monistic frameworks, not comparing problems. 

Overarching frameworks: These include Bayesian (Robert, 1994), information-theoretic 
(Jenssen, 2005a; Harremoes, 1993), game-theoretic (Vovk et al., 2005; Griinwald and 
Dawid, 2004), MDL (Griinwald, 2007; Rissanen, 2007), regularised distance minimisa- 
tion (Borwein and Lewis, 1991; Altun and Smola, 2006; Broniatowski, 2004), and more 
narrowly focussed "unifying frameworks" such as information geometry (Dawid, 2007; 
Eguchi, 2005), exponential families (Canu and Smola, 2006) and the information bottle- 
neck (Tishby et al., 2000). 

Appendix C. Examples and Prior Work on Surrogate Loss Bounds 

Surrogate loss bounds have garnered increasing interest in the machine learning commu- 
nity (Zhang, 2004b; Bartlett et al., 2006; Steinwart, 2007; Steinwart and Christmann, 
2008). Steinwart and Christmann (2008, Chapter 3) have presented a good summary of 
recent work. 

All of the recent work has been in terms of margin losses of the form 



As Buja et al. (2005) discuss, such margin losses can not capture the richness of all 
possible proper scoring rules. Bartlett et al. (2006) prove that for any h 



L*(r ) ,h) = r l <f>Ch) + (l-ri)<f>(-h). 



V (L - 1 ^, h) - L°-\ V )) < L^, h) - L*( V ) 



where ip = tp** is the LF biconjugate of ip 
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H(v) = L^(r]) and 



H-( V )= inf (^( a ) + (l-^(-a)) 

a : a(2r]— 1)<0 

is the optimal conditional risk under the constraint that the sign of the argument a 
disagrees with 2r] — 1. 

We will consider two examples presented by Bartlett et al. (2006) and show that the 
bounds we obtain with the above theorem match the results we obtain with Theorem 26. 

Exponential Loss Consider the link h = ip(ff) = \ log with corresponding inverse 
link f] = — 1 _ 2h . Buja et al. (2005) showed that this link function combined with 
exponential margin loss ^(7) = e~ 7 results in a proper scoring rule 

From (44) we obtain 

W(77) = 3- 

2[77(l-77)]2 

(Note Buja et al. (2005) have missed the factor of \.) Thus W{rj) 
W(rj) = —2^/r]{\ — 77). Hence from (53) we obtain 

L(V) = 2^(1 " V) (130) 
and from (74) we obtain that if Bi(rj, 77) = a then 

B(rj, 77) > 1 - \/l -4a 2 . (131) 

Equations 130 and 131 match the results presented by Bartlett et al. (2006) upon 
noting that Bi (77, 77) measures the loss in terms of £1 and Bartlett et al. (2006) 

used I ' 1 = 2/1 . 

2 

Truncated Quadratic Loss Consider the margin loss (j)(h) = (1 + h V 0) 2 = (2r) V 0) 2 
with link function ^(77) = 2t) — 1. From (44) we obtain L(r/) = 477(1 — 77) and 
from (74) the regret bound -8(77, 77) > 4a 2 . These match the results presented by 
Bartlett et al. (2006) when again it is noted we used £1 and they used l ^ 1 . 

The above results are for cq = ^ . Generalisations of margin losses to the case of uneven 
weights are presented by Steinwart and Christmann (2008, Section 3.5). Nevertheless, 
since the same 4> function is still used for both components of the loss (albeit with 
unequal weights) such a scheme can still not capture the full generality of all proper 
scoring rules in the manner achieved by the results in Section 7.1. 



2??-i 



and 
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Appendix D. A Brief History of Pinsker Inequalties 



Pinsker (1964) presented the first bound relating KL(P, Q) to V(P, Q): KL > V 2 /2 and 
it is now known by his name or sometimes as the Pinsker-Csiszar-Kullback inequality 
since Csiszar (1967) presented another version and Kullback (1967) showed KL > V 2 /2 + 
y 4 /36. Much later Tops0e (2001) showed KL > V 2 /2+V A /m+V & /27V. Non-polynomial 

bounds are due to Vajda (1970): KL > -Lvajda(^) := log (§^f) — and Toussaint 
(1978) who showed KL > L Vaj da(^) V (V 2 /2 + V 4 /36 + y 8 /288). 

Care needs to be taken when comparing results from the literature as different def- 
initions for the divergences exist. For example Gibbs and Su (2002) use a definition of 

V that differs by a factor of 2 from ours. There are some isolated bounds relating V to 
some other divergences, analogous to the classical Pinkser bound; Kumar and Chhina 
(2005) have presented a summary as well as new bounds for a wide range of symmetric 
/-divergences by making assumptions on the likelihood ratio: r < p(x) / q(x) < R < oo 
for all x G X. This line of reasoning has also been developed by Dragomir et al. (2001); 
Taneja (2005a, b). Tops0e (2000) has presented some infinite series representations for 
capacitory discrimination in terms of triangular discrimination which lead to inequali- 
ties between those two divergences. Liese and Miescke (2008, p. 48) give the inequality 

V < h\/4 — h 2 (which seems to be originally due to LeCam (1986)) which when rear- 
ranged corresponds exactly to the bound for h 2 in theorem 28. Withers (1999) has also 
presented some inequalities between other (particular) pairs of divergences; his reasoning 
is also in terms of infinite series expansions. 

Unterreiter et al. (2000) considered the case of n = 1 but arbitrary 1/ (that is they 
bound an arbitrary /-divergence in terms of the variational divergence). Their argument 
is similar to the geometric proof of Theorem 27. They do not compute any of the explicit 
bounds in theorem 28 except they state (page 243) ^(P, Q) > V 2 which is looser than 
(94). 

Gilardoni (2006a) showed (via an intricate argument) that if f"'(l) exists, then If > 

1 g — . He also showed some fourth order inequalities of the form 1/ > C2jV + c^jV 
where the constants depend on the behaviour of / at 1 in a complex way. Gilardoni 
(2006b, c) presented a completely different approach which obtains many of the results 
of theorem 28. 31 Gilardoni (2006c) improved Vajda's bound slightly to KL(P, Q) > 
l n ^_ _ 2=Y. i n 2+Z 

111 2-V 2 m 2 • 

Gilardoni (2006b, c) presented a general tight lower bound for If(P,Q) in terms of 
V(P, Q) which is difficult to evaluate explicitly in general: 

j > v ( MM i MM ^1 

f ~ 2 \g-\k{l/V)) - 1 l-g-\k(l/V))J ' 

where k~\t) = \ + ^p^) > = (O^M, </(«) = (« " !)/'(«) " /(«), 

g~^~[g(uj\ = u for u > 1 and gj^igiu)} = u for u < 1. He presented a new parametric 

31. We were unaware of these two papers until completing the results presented in the main paper. 
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form for If = KL in terms of Lambert's W function. In general, the result is analogous 
to that of Fedotov et al. (2003) in that it is in a parametric form which, if one wishes 
to evaluate for a particular V, one needs to do a one dimensional numerical search - 
as complex as (95). However, when / is such that If is symmetric, this simplifies to 

the elegant form If > 2 -^~f (jzy^j ~ /' He presented explicit special cases for h 2 , 
J, A and / identical to the results in Theorem 28. It is not apparent how the approach 
of Gilardoni (2006b, c) could be extended to more general situations such as that in 
Theorem 27 (i.e. n > 1). 

Finally Bolley and Villani (2005) have considered weighted versions of the Pinsker 
inequalities (bounds for a weighted generalisation of Variational divergence) in terms of 
KL-divergence that are related to transportation inequalities. 

Appendix E. Examples of extended convolution factorisation 

In this section we present three examples of / which can be written as / = gdg. 

If g(t) = (t — l) 2 (corresponding to Pearson x 2 divergence), {gUg){r) = in.{ xe ^+(x — 
l) 2 + t(x/t — l) 2 . Differentiating the right-hand side with respect to x, setting to zero 
and solving for x gives x = 2 (i+i/ T ) • Substituting we obtain («/□</) (r) = which is 

the / for A(P, Q), the triangular discrimination. 

If g(t) = tln(t), a similar straightforward calculation yields (gdg)(T) = ~ 2 g v ^ . 

If g(t) = (Vi— l) 2 (corresponding to Hellinger divergence) then a similar calculation 
yields (</□</) (r) = \{\fr — l) 2 = g(r)/2. Thus this g plays a role analogous to a gaussian 
kernel in ordinary convolution. The significance of this is unclear. 

We summarise the results (and the associated g*) in the following table. 



g(t) 


(<?□<?) (r) 


9*(s) 


(t-i) 2 


(r-l) 2 

T-1 




tint 




6 s - 1 


(v^-1) 2 




^[^lj + ool^l] 



Whilst it is indeed straightforward to compute (gOg) given g (although a simple 
closed form is not always possible), it is far from obvious how to go from a given / to a 
g such that / = g\3g. 

Hiriart-Urruty and Lemarechal (1993a, page 69) show that for / convex on M + , g 
convex and increasing on R + , 

(g o /)*(*) = inf <*/*(*) + 5*(«) = 

This illuminates the difficulty of the above "factorisation problem" . It is equivalent to: 
given a convex increasing /*, find a convex increasing g* such that f*=g*°g*. 
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Appendix F. Empirical Estimators of V Bm i(P,Q) and SVMs 

This appendix further develops the observations made in Section 8.1.1 regarding the re- 
lationship between divergence and risk when 31 = B^, a unit ball in a reproducing kernel 
Hilbert space In contrast to the rest of the paper (which focussed on relationships 
involving the underlying distributions), in this appendix we will consider the practical 
situation where there is only an empirical sample. We will see how the general results 
have interesting implications for sample based machine learning algorithms. 

If we require an empirical estimate of V^ )7r (P, Q) we can replace P and Q by empirical 
distributions. We will use weighted empirical distributions. Given an independent iden- 
tically distributed sample w = (w\, . . . , w m ) € X m the a- weighted empirical distribution 
P£ with respect to w is defined by 



dPZ ■= X] a ^(' ~~ Wi 

i=i 



where a = (cci, . . . , a m ), ati > 0, i = 1, . . . , m and YllLi a i = 1- We will write := 
Ep«^ = 5XiMW- Thus ~^ 

vldPz,P^) = \\\^-^\\li- 

Suppose now that P and Q correspond to the positive and negative class conditional 
distributions. Let x := (xi, . . . , x m ) be a sample drawn from M = irP + (1 — tt)Q with 
corresponding label vector y = (yi, . . . , y m ). Let I := {1, . . . , to,}, I + := {i £ P. m = 1}, 
I~ := {i £ P. yi = —1}. Consider a weight vector a = (a±, . . . ,a m ) over the whole 
sample. Thus 

Ep0 = ^2 oti<j){xi) and Eq4> = ^ a^Xi) 
iei+ iei- 

where we also require 

ETO+ x - TO~ 
= and > a« = (132) 

to ^— ' TO 

iei"+ isl- 



and hence 



ETO+ — TO 
TO 

16/ 
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Substituting into (106) we have 



\iei+ iei~ jei+ jei~ 



= (Y a iyi<f>(xi), Y a jy^( x i) ) 

\ iei jei I 



= J212 aia j yiy j^ Xi ^^ x ^ 

iei jei 

= Y^2 aia jyiyj k ( x h x j) ='■ J(at,x). (133) 

iei jei 

We now consider three different choices of a. 

Uniform weighting If we set cti = ^, i = 1, . . . , m, then (133) becomes 

^2 2 yiVjHxuXj) = MMB 2 b [B^,x + ,x-} 
i,jei 

where x + := (xi) ieI +, x := (xi) ieI - and MMD5 is the biased estimator of the Maximum 
Mean Discrepancy (Gretton et al., 2008), an alternate name for VJr. Observe that from 
theorem 30, this case corresponds to using a Fisher linear discriminant in feature space 
(Devroye et al., 1996) when it is assumed that the within-class covariance matrices are 
both the identity matrix. This follows by observing that the constructed hypothesis is 
identical in both cases. 

Pessimistic Weighting Instead of weighting each sample equally, one can optimise 
over a. By theorem 30, minimizing J(a,x) over a. will maximize L lin and is thus the 
most pessimistic choice. Explicitly, we have 

m m 

nun YYj aia i yiy i k ( Xi ' x ^ ( 134 ) 
i=l i=l 

s.t. aj > 0, i = l, ...,m (135) 

m _i_ _ 

mr — TO 



y^aiyi = (136) 

m 



i=l 



which can be recognized as the support vector machine (Cortes and Vapnik, 1995). The 
SVM uses the sign of the "witness" (Gretton et al., 2008), x J^iLi a iyik(xi, x) as its 
predictor. 



75 



Interpolation between above two cases A parametrized interpolation between 
the above two cases can be constructed by the addition of the constraints 

oti < , i = l,...,m, (138) 
vm 

where v 6 (0, 1] is an adjustable parameter. Observe that v controls the sparsity of a 
since (138), (135) and (137) together imply that \{i £ I: on ^ 0}| > vm. Crisp and 
Burges (2000) have shown that (134),. . .,(138) is equivalent to the zv-SVM algorithm 
(Scholkopf et al., 2000). 

While "information-theoretic" approaches to the SVM and weighted kernel represen- 
tations are hardly new 32 , the results presented here are novel and provide a simple and 
direct derivation of the SVM via the generalised variational divergence. 

If V Bm i(P w ,Qz) is used as a test statistic to infer whether two samples w and 
z are drawn from the same distribution (as Gretton et al. (2008) do), then when the 
distributions from which w and z are drawn are close, the classification performance of 
the corresponding classifier (i.e. the classifier that uses the sign of the witness function) 
will be close to the worst possible. Thus one will be operating in a regime distinct from 
the normal situation, where the risk is typically small. 

Finally observe that the derivation of the SVM presented here could be viewed as 
an application of an alternate "inductive principle" — a general recipe for constructing 
learning algorithms from learning task specification (Vapnik, 1989, 2006). The tradi- 
tional Empirical Risk Minimization principle entails replacing (P,Q) with (P x +, Q x -) 
in the definition of L(7r, P, Q). Then, in order to not overfit, one restricts the class of 
functions from which hypotheses are drawn. That is, there are two approximations: 

L(7T, -P, (^) Empirical Approximation (uniform) ]L(7T, P x + , Q x ~ ) Restrict Class ^ L^(7T, P x + , Q x~ ) * 

32. The use of kernel representations for classification is of course not new: from the classical kernel 
classifier (where ai — 1/m for all i £ I) (Devroye et al., 1996, Chapter 10) to the Generalised 
Portrait (Aizerman et al., 1964), the Generalised Discriminant (Baudat and Anouar, 2000) and the 
panoply of techniques inspired by Support Vector Machines (Scholkopf and Smola, 2002; Herbrich, 
2002). None of these techniques is designed from the perspective of minimising a /-divergence. 

Principe et al. (2000a) have developed an approach to machine learning problems based on 
information theoretic criteria (Principe et al., 2000b; Jenssen et al., 2004; Xu et al., 2005; Jenssen, 
2005b; Jenssen et al., 2006; Pavia et al., 2006). Jenssen et al. (2004, 2006) considered kernel methods 
from the perspective of Renyi's quadratic entropy. They do not exploit the formal relationship 
between maximising divergence and minimising risk. They interpret the SVM as being constructed 
from weighted Parzen windows density estimates. Gretton et al. (2008) explained the relationship 
between their MMD estimators and those derived from (unweighted) Parzen windows estimates of 
the class-conditional distributions. Weighted Parzen windows estimates were used as a basis for 
building a classifier by Babich and Camps (1996). Weighted empirical distributions are widely used 
in particle filtering (Crisan and Doucet, 2002). 

McDermott and Katagiri (2002) considered the direct optimisation of a classifier built on top of 
Parzen windows density estimates. They showed that the minimum classification error criterion is 
equivalent to a Parzen windows estimate of the theoretical Bayes risk. They re-derive the traditional 
approach of minimising an estimate of the expected loss. McDermott and Katagiri (2003) extended 
their approach to the multi-class setting in a way that takes account of all the "other" classes better 
in estimating the probability of error of a given class. 
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Upon setting a + = (ai) i£l + and a = (ai) ieI -, the derivation presented above, in 
contrast, can be summarised schematically by 

Kj(7T, P, Restrict Class ^ L<^(7r, P, Q^) Empirical Approximation (a-wcightcd) L<J^(7T, P x + , Q x — ); 

where a different loss (the "linear" loss) was used at the start. With that loss function, 
reversing the order of the two approximations would not work, and is (thus) not equiv- 
alent to the ERM inductive principle. The first step makes L well defined — with no 
restriction it is not, hence the quotes; and will avoid overfitting in any case. The second 
step is the more general (a-weighted) empirical approximation. 
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